Methods and compositions for sequencing modified nucleic acids

ABSTRACT

Methods, compositions, and systems are provided for characterization of modified nucleic acids. In certain preferred embodiments, single molecule sequencing methods are provided for identification of modified nucleotides within nucleic acid sequences. Modifications detectable by the methods provided herein include chemically modified bases, enzymatically modified bases, abasic sites, non-natural bases, secondary structures, and agents bound to a template nucleic acid.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/836,806 filed Mar. 15, 2013, which claims the benefit of U.S.Provisional Application No. 61/617,999, filed Mar. 30, 2012, which isincorporated herein by reference. This application is further related toU.S. Provisional Application Nos. 61/721,206 and 61/721,339, both filedNov. 1, 2012, U.S. Provisional Application No. 61/789,354 filed Mar. 15,2013 and U.S. Provisional Application No. 61/799,237 filed Mar. 15,2013, all of which are incorporated herein by reference in theirentireties for all purposes.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

Not Applicable.

BACKGROUND OF THE INVENTION

Assays for analysis of biological processes are exploited for a varietyof desired applications. For example, monitoring the activity of keybiological pathways can lead to a better understanding of thefunctioning of those systems as well as those factors that might disruptthe proper functioning of those systems. In fact, various differentdisease states caused by operation or disruption of specific biologicalpathways are the focus of much medical research. By understanding thesepathways, one can model approaches for affecting them to prevent theonset of the disease or mitigate its effects once manifested.

A stereotypical example of the exploitation of biological processmonitoring is in the area of pharmaceutical research and development. Inparticular, therapeutically relevant biological pathways, or individualsteps or subsets of individual steps in those pathways, are oftenreproduced or modeled in in vitro systems to facilitate analysis. Byobserving the progress of these steps or whole pathways in the presenceand absence of potential therapeutic compositions, e.g., pharmaceuticalcompounds or other materials, one can identify the ability of thosecompositions to affect the in vitro system, and potentially beneficiallyaffect an organism in which the pathway is functioning in a detrimentalway. By way of specific example, reversible methylation of the 5′position of cytosine by methyltransferases is one of the most widelystudied epigenetic modifications. In mammals, 5-methylcytosine (5-MeC)frequently occurs at CpG dinucleotides, which often cluster in regionscalled CpG islands that are at or near transcription start sites.Methylation of cytosine in CpG islands can interfere with transcriptionfactor binding and is associated with transcription repression and generegulation. In addition, DNA methylation is known to be essential formammalian development and has been associated with cancer and otherdisease processes. Epigenetic enhancer patterns have been identified incolon cancer cell lines, and a 5-hydroxymethylcytosine epigenetic markerhas been identified in certain cell types in the brain, suggesting thatit plays a role in epigenetic control of neuronal function(Akhtar-Zaidi, et al. (2012) Science 336(6082):736-739; and S.Kriaucionis, et al., Science 2009, 324(5929): 929-30, incorporatedherein by reference in their entireties for all purposes). Furtherinformation on cytosine methylation and its impact on gene regulation,development, and disease processes is provided in the art, e.g., in A.Bird, Genes Dev 2002, 16, 6; M. Gardiner-Garden, et al., J Mol Riol1987, 196, 261; S. Saxonov, et al., Proc Natl Acad Sci USA 2006, 103,1412; R. Jaenisch, et al., Nat Genet 2003, 33 Suppl, 245; E. Li, et al.,Cell 1992, 69, 915; A. Razin, et al., Hum Mol Genet 1995, 4 Spec No,1751; P. A. Jones, et al., Nat Rev Genet 2002, 3, 415; P. A. Jones, etal., Nat Genet 1999, 21, 163; and K. D. Robertson, Nat Rev Genet 2005,6, 597, all of which are incorporated herein by reference in theirentireties for all purposes. Further, a large number of other nucleotidemodifications are known in the art that play biological roles in somecapacity, and these include, without limitation, N⁶-methyladenosine,N³-methyladenosine, N⁷-methylguanosine, pseudouridine, thiouridine,isoguanosine, isocytosine, dihydrouridine, queuosine, wyosine, inosine,triazole, diaminopurine, and 2′-O-methyl derivatives of adenosine,cytidine, guanosine, and uridine.

In contrast to determining a human genome, mapping of the humanmethylome is a more complex task because the methylation status differsbetween tissue types, changes with age, and is altered by environmentalfactors (P. A. Jones, et al., Cancer Res 2005, 65, 11241, incorporatedherein by reference in its entirety for all purposes). Comprehensive,high-resolution determination of genome-wide methylation patterns from agiven sample has been challenging due to the sample preparation demandsand short read lengths characteristic of current DNA sequencingtechnologies (K. R. Pomraning, et al., Methods 2009, 47, 142,incorporated herein by reference in its entirety for all purposes).

Bisulfite sequencing is the current method of choice forsingle-nucleotide resolution methylation profiling (S. Beck, et al.,Trends Genet 2008, 24, 231; and S. J. Cokus, et al., Nature 2008, 452,215, the disclosures of which are incorporated herein by reference intheir entireties for all purposes). Treatment of DNA with bisulfiteconverts unmethylated cytosine, but not 5-MeC, to uracil (M. Frommer, etal., Proc Natl Acad Sci USA 1992, 89, 1827, incorporated herein byreference in its entirety for all purposes). The DNA is then amplified(which converts all uracils into thymines) and subsequently analyzedwith various methods, including microarray-based techniques (R. S.Gitan, et al., Genome Res 2002, 12, 158, incorporated herein byreference in its entirety for all purposes) or 2^(nd)-generationsequencing (K. H. Taylor, et al., Cancer Res 2007, 67, 8511; and R.Lister, et al., Cell 2008, 133, 523, both incorporated herein byreference in their entireties for all purposes). While bisulfite-basedtechniques have greatly advanced the analysis of methylated DNA, theyalso have several drawbacks. First, bisulfite sequencing requires asignificant amount of sample preparation time (K. R. Pomraning, et al.,supra). Second, the harsh reaction conditions necessary for completeconversion of unmethylated cytosine to uracil lead to degradation of DNA(C. Grunau, et al., Nucleic Acids Res 2001, 29, E65, incorporated hereinby reference in its entirety for all purposes), and thus necessitatelarge starting amounts of the sample, which can be problematic for someapplications.

Furthermore, because bisulfite sequencing relies on either microarray or2nd-generation DNA sequencing technologies for its readout ofmethylation status, it also suffers from the same limitations as dothese methodologies. For array-based procedures, the reduction insequence complexity caused by bisulfite conversion makes it difficult todesign enough unique probes for genome-wide profiling (S. Beck, et al.,supra). Most 2nd-generation DNA sequencing techniques employ short readsand thus have difficulties aligning to highly repetitive genomic regions(K. R. Pomraning, et al., supra). This is especially problematic, sincemany CpG islands reside in such regions. Given these limitations,bisulfite sequencing is also not well suited for de novo methylationprofiling (S. Beck, et al., supra).

In another widely used technique, methylated DNA immunoprecipitation(MeDIP), an antibody against 5-MeC is used to enrich for methylated DNAsequences (M. Weber, et al., Nat Genet 2005, 37, 853, incorporatedherein by reference in its entirety for all purposes). MeDIP has manyadvantageous attributes for genome-wide assessment of methylationstatus, but it does not offer as high base resolution as bisulfitetreatment-based methods. In addition, it is also hampered by the samelimitations of current microarray and 2nd-generation sequencingtechnologies.

Research efforts aimed at increasing our understanding of the humanmethylome would benefit greatly from the development of a newmethylation profiling technology that does not suffer from thelimitations described above. Further, additional modifications are knownto occur in human genetic material that are not detectable by themethods described above, e.g., hydroxymethylcytosine bases. Accordingly,there exists a need for improved techniques for detection ofmodifications in nucleic acid sequences, and particularly nucleic acidmethylation.

Further, DNA is under constant stress from both endogenous and exogenoussources and is vulnerable to chemical modifications through differenttypes of damage, including oxidation, alkylation, radiation damage, andhydrolysis. DNA base modifications resulting from these types of DNAdamage are wide-spread and play important roles in physiologicalpathways and disease phenotypes (see, e.g., Geacintov, et al. (2010) TheChemical Biology of DNA Damage, Wiley-VCH Verlag GmbH & Co. KGaA;Kelley, M R (2011) DNA Repair in Cancer Therapy: Molecular Targets andClinical Applications, Elsevier Science; and Preston, et al. (2011)Semin. Cancer Biol. 20:281-293, the disclosures of which areincorporated herein by reference in their entireties for all purposes).Examples include 8-oxoguanine, 8-oxoadenine (oxidative damage; aging,Alzheimer's, Parkinson's), 1-methyladenine, 6-O-methylguanine(alkylation; gliomas and colorectal carcinomas), benzo[a]pyrene diolepoxide (BPDE), pyrimidine dimers (adduct formation; smoking, industrialchemical exposure, UV light exposure; lung and skin cancer), and5-hydroxycytosine, 5-hydroxyuracil, 5-hydroxymethyluracil, and thymineglycol (ionizing radiation damage; chronic inflammatory diseases,prostate, breast and colorectal cancer). Currently, these and otherproducts of DNA damage are detected using bulk measurements includingchromatographic techniques, polymerase chain reaction assays, the Cometassay, mass spectrometry, electrochemistry, radioactive labeling andimmunochemical methods (see, e.g., Kumari, et al. (2008) EXCLI J.7:44-62, incorporated herein by reference in its entirety for allpurposes). Sequencing individual DNA molecules would be beneficial formapping base damage, which can occur at random DNA template positions.

Typically, modeled biological systems rely on bulk reactions thatascertain general trends of biological reactions and provide indicationsof how such bulk systems react to different effectors. While suchsystems are useful as models of bulk reactions in vivo, a substantialamount of information is lost in the averaging of these bulk reactionresults. In particular, the activity of and effects on individualmolecular complexes cannot generally be teased out of such bulk datacollection strategies.

Single-molecule real-time analysis of nucleic acid synthesis has beenshown to provide powerful advantages over nucleic acid synthesismonitoring that is commonly exploited in sequencing processes. Inparticular, by concurrently monitoring the synthesis process of nucleicacid polymerases as they work in replicating nucleic acids, one gainsadvantages of a system that has been perfected over millions of years ofevolution. In particular, the natural DNA synthesis processes providethe ability to replicate whole genomes in extremely short periods oftime, and do so with an extremely high level of fidelity to theunderlying template being replicated.

BRIEF SUMMARY OF THE INVENTION

The present invention is generally directed to the combined analysis ofepigenetic modifications and genomic sequences. In preferredembodiments, a single analytical reaction provides both base sequencedata and modification data for a single nucleic acid template molecule.For example, a real time single molecule sequencing method can be usedto detect modified nucleic acid sequences, e.g., methylated bases,within nucleic acid sequences. The present invention is expected to havea major impact on research aiming to illuminate the role of nucleic acidmodifications, e.g., epigenetic modifications, in the study of genetics,genomics, and human health.

Allele calling in DNA sequencing is affected by binomial samplingstatistics, which complicates the distinction between errors andheterozygote genotypes at low coverage. This challenge is even greaterin polyploidy organisms. Methods herein facilitate identification of achromosomal source of a sequence read based on characteristic basemodifications. In certain aspects, the invention provides methods foridentifying a chromosomal source of a sequencing read, and then groupingthat data with other sequencing reads from the same source prior tofurther statistical analysis, e.g., consensus sequence determination. Incertain embodiments, differential modification of homologous chromosomesis detected kinetically during a sequencing reaction, and the resultingdifferential modification data s used to identify the chromosomal sourceof the sequencing read corresponding to the kinetic data.

In certain aspects of the invention, a method of sequencing homologouschromosomes is provided that comprises: a) providing a pair ofhomologous chromosomes having a locus of interest, the pair comprising afirst homolog and a second homolog, wherein the first homolog comprisesa modified base at the locus of interest and the second homolog lacksthe modified base at the locus of interest; b) sequencing the firsthomolog and the second homolog, wherein each nucleotide sequenced issubjected to an interrogation that provides both base composition dataand kinetic data, thereby generating sequence reads for the locus ofinterest from both the first homolog and the second homolog, thesequence reads comprising both said base composition data and saidkinetic data; c) analyzing the base composition data and the kineticdata for said locus of interest to identify a first subset of thesequence reads that comprise the modification, and assigning the firstsubset to the first homolog; and d) analyzing the base composition dataand the modification data for said locus of interest to identify asecond subset of the sequence reads that lack the modification, andassigning the second subset to the second homolog. In some embodiments,the modified base is a methylated base. In preferred embodiments, thelocus of interest from either or both homologs is notbisulfite-converted, amplified, and/or cloned. Single-moleculesequencing is a preferred sequencing technology, e.g.,sequencing-by-incorporation, tSMS sequencing, and nanopore sequencing,and both homologs can be sequenced in the same reaction mixture. In someapplications of the method, the homologs are X chromosomes from anindividual. The locus of interest is optionally within a highlyrepetitive region or an imprinted region. Alternatively or in addition,the locus of interest is at least one of a drug target, a locusassociated with a genetic disorder, or a forensic marker.

In other aspects, the invention provides methods for identifyingsequence differences between two nucleic acids from different sources.In some embodiments, such a method comprises: providing a first nucleicacid from a first sample; providing a second nucleic acid from a secondsample; treating the first nucleic acid to produce a modified nucleicacid; denaturing the modified nucleic acid and the second nucleic acid;annealing the modified nucleic acid to the second nucleic acid, therebyproducing hybrid nucleic acids that comprise a first modified strandfrom the first sample and a second unmodified strand from the secondsample, and further wherein a portion of the hybrid nucleic acidscomprise non-complementary regions of one or more base pairs where thefirst modified strand and the second unmodified strand arenon-complementary; binding a non-complementary-region-specific bindingagent to the portion of the hybrid nucleic acids, wherein thenon-complementary-region-specific binding agent comprises a selectabletag; capturing the portion of the hybrid nucleic acids bound to thenon-complementary-region-specific binding agent using the selectabletag; removing nucleic acids not bound to thenon-complementary-region-specific binding agent; subjecting the portionof the hybrid nucleic acids to a single type of sequencing reaction inwhich both the first and second strands of the hybrid nucleic acids aresequenced and both sequence data and modification data are provided in asingle sequence read; for each sequence read, analyzing the sequencingdata to identify said single-stranded regions where the first modifiedstrand and second unmodified strand are non-complementary; and for eachsequence read, analyzing the modification data to determine whichportion of said sequence read corresponds to the first modified strandand which portion corresponds to the second unmodified strand, therebyidentifying sequence differences between the first nucleic acid from thefirst sample and the second nucleic acid from the second sample.Optionally, the treating introduces one or more modifications into thefirst nucleic acid that render it distinguishable from the secondnucleic acid. The modifications typically, but not necessarily, comprisemethylation, demethylation, hydroxymethylation, or glucosylation. Insome preferred embodiments, the first sample is a tumor sample and thesecond sample is a non-tumor sample, and the sequence differencesidentified correspond to mutations that occurred during development of atumor. In certain embodiments, the the non-complementary-region-specificagent is selected from a single-strand-specific binding agent and amismatch-repair agent. In optional embodiments, subsequent to theannealing and prior to the subjecting the method comprises incorporatingthe hybrid nucleic acids into template molecules having at least oneadapter that links the first modified strand to the second unmodifiedstrand, and such incorporating preferably occurs prior to the capturing.

In yet further aspects, methods for identifying loci where two nucleicacids from two different sources are non-complementary. Such a methodcomprises, in certain embodiments, providing a first nucleic acid from afirst sample; providing a second nucleic acid from a second sample;denaturing the first nucleic acid and the second nucleic acid; annealingthe first nucleic acid to the second nucleic acid, thereby producinghybrid nucleic acids that comprise a first strand from the first sampleand a second strand from the second sample, and further wherein aportion of the hybrid nucleic acids comprise non-complementary regionsof one or more base pairs where the first strand and the second strandare non-complementary; performing mismatch repair on the hybrid nucleicacids using modified nucleotides comprising a selectable tag, therebygenerating repaired hybrid nucleic acids; isolating the repaired hybridnucleic acids using the selectable tag; subjecting the repaired hybridnucleic acids to a single type of sequencing reaction in which both thefirst and second strands of the repaired hybrid nucleic acids aresequenced and both sequence data and modification data are provided in asingle sequence read; and for each sequence read, analyzing the sequencedata and modification data to determine which portion of said sequenceread comprises the modified nucleotides, thereby identifying loci atwhich the first nucleic acid and second nucleic acid werenon-complementary. In some embodiments, the modified nucleotidescomprise one or more methylated nucleotides, hydroxymethylatednucleotides, or glucosylated nucleotides. In certain embodiments, thefirst sample is a tumor sample and the second sample is a non-tumorsample, and further wherein the loci at which the first nucleic acid andsecond nucleic acid were non-complementary correspond to mutations thatoccurred during development of a tumor. In optional embodiments,subsequent to the annealing and prior to the subjecting the methodcomprises incorporating the hybrid nucleic acids into template moleculeshaving at least one adapter that links the first strand to the secondstrand, and such incorporating preferably occurs prior to the capturing.

In other aspects, the invention provides methods of identifying fetalsequence reads that comprise providing a mixture of maternal nucleicacids and fetal nucleic acids; sequencing individual nucleic acids fromthe mixture, thereby generating a set of sequence reads comprisingsequence reads from the maternal nucleic acids and sequence reads fromthe fetal nucleic acids, wherein each of the sequence reads have bothbase sequence data and modification data; and analyzing the modificationdata to identify which of the set of sequence reads are fetal sequencereads.

Similarly, the invention provides methods of identifying tumor-derivedsequence reads that comprise providing a mixture of non-tumor-derivednucleic acids and tumor-derived nucleic acids; sequencing individualnucleic acids from the mixture, thereby generating a set of sequencereads comprising sequence reads from the non-tumor-derived nucleic acidsand sequence reads from the tumor-derived nucleic acids, wherein thesequence reads have both base sequence data and modification data; andanalyzing the modification data to identify which of the set of sequencereads are tumor-derived sequence reads.

The invention also provides, in certain aspects, methods of identifyingaberrant cells in a biological sample that comprise providing abiological sample comprising a mixture of cells; isolating nucleic acidsfrom the mixture of cells; individually sequencing said nucleic acidsfrom the mixture, thereby generating a set of sequence reads comprisingboth base sequence data and modification data; and analyzing themodification data to identify which of the set of sequence reads arenative to the biological sample and which of the set of sequence readsare from aberrant cells in the biological sample. Optionally, thebiological sample is a blood sample, sputum sample, urine sample,nasopharangeal sample, vaginal sample, biopsy, buccal sample, andcolonic sample. In preferred embodiments, the aberrant cells are one ormore of tumor cells, stem cells, pluripotent cells, bacterial cells,fungal cells, embryonic cells, or cells from a parasitic organism.

In certain aspects, methods are provided for identifying a pluripotentcell line that include providing a differentiated cell line, contactingthe differentiated cell line with at least one reprogramming agent thatcontributes to reprogramming of said cell to a pluripotent state;maintaining said cell line under conditions appropriate forproliferation of said cell line and for activity of said at least onereprogramming agent for a period of time sufficient to beginreprogramming of said cell line; and periodically sequencing nucleicacids isolated from said cell line to provide both base sequence dataand modification data in single sequence reads, wherein the prevalenceof non-CpG methylation in the cell line is indicative that thedifferentiated cell line has been reprogrammed into a pluripotent cellline.

In other aspects, methods are provided for identifying pseudogenesequence reads that include providing a mixture of nucleic acids from agenome comprising both genes and pseudogenes; sequencing individualnucleic acids from the mixture, thereby generating a set of sequencereads comprising sequence reads from the genes and sequence reads fromthe pseudogenes, wherein each of the sequence reads have both basesequence data and methylation data; and analyzing the methylation datato identify which of the set of sequence reads are pseudogene sequencereads by virtue of the different methylation patterns in genes andpseudogenes.

In yet further aspects, methods are provided for diagnosing anenvironmental exposure that involve performing single-moleculesequencing on nucleic acids isolated from an individual, where thesequencing provides both genetic data and epigenetic data; and based onthe genetic data and epigenetic data, diagnosing whether the individualexperienced the environmental exposure. In some embodiments, theepigenetic data is indicative of (1) activation or inactivation of agene known to be impacted by the environmental exposure, or (2)activation of a metabolic pathway specific for response to theenvironmental exposure. Optionally, the environmental exposure is one ormore of radiation exposure, toxin exposure, pathogen exposure, andmalnutrition. In some embodiments, the environmental exposure isindicated by an increase or decrease in methylation of certain genomicregions, an increase or decrease of histone binding to certain genomicregions, and/or an increase or decrease of transcription factor bindingto certain genomic regions. In certain preferred embodiments, thenucleic acids are RNA molecules and a change in RNA expression, RNAsplicing, RNA base modifications, and/or RNA secondary structure isindicative of the environmental exposure.

In certain aspects, method for identifying a strain of microorganism isprovided. Such methods preferably comprise sequencing epigenetic markersisolated from the microorganism to determine an epigenetic genotype forthe microorganism, and identifying the strain of microorganism based onthe epigenetic genotype, wherein the sequencing generates a set ofsequence reads comprising both base sequence data and modification data.The microorganism can optionally be a virus, bacteria, archaean,protozoan, or a fungus, and can be pathogenic or nonpathogenic.

Some aspects of the invention provide methods for determining expressionprofiles. In preferred embodiments, such a method comprises isolatingDNA from an organism; sequencing the DNA to generate sequence readscomprising both sequence data and epigenetic modification data;identifying genes in the DNA based upon the sequence data; anddetermining which of the genes so identified are being expressed basedupon the epigenetic modification data. The DNA is optionally derivedfrom portions of a genome known to contain genes involved in a diseaseor metabolic pathway of interest, and the disease can be any diseasethat is affected by epigenetic modifications, e.g., cancer, diabetes,heart disease, or organ rejection. In certain preferred embodiments, theDNA is derived from multiple different tissues.

Methods of identifying mRNA expression are also provided in some aspectsof the invention. For example, a method of identifying mRNA expressioncan comprise growing a cell culture in the presence of modified rNTPs;isolating mRNA molecules from the cell culture; sequencing the mRNAmolecules to provide sequence reads comprising both sequence andmodification data; and identifying which of the mRNA molecules haveincorporated the modified rNTPs, thereby identifying mRNA expression inthe cell culture.

Further, methods of distinguishing between a sequence read from a firstsource and a sequence read from a second source are also provided bycertain aspects of the invention. In one such embodiment, the methodcomprises providing a single reaction mixture comprising both a firstnucleic acid from a first source and a second nucleic acid from a secondsource; simultaneously subjecting the first nucleic acid and the secondnucleic acid in the single reaction mixture to a single-molecule realtime sequencing reaction that generates sequence reads comprising bothbase sequence data and modification sequence data; and analyzing thesequence reads, wherein a first subset of the sequence reads isdetermined to have both sequence data and modification data consistentwith the first source, and a second subset of the sequence reads isdetermined to have both sequence data and modification data consistentwith the second source, thereby distinguishing between sequence readsfrom the first source and sequence reads from the second source.

The invention also provides, in some aspects, methods fordifferentiating nucleic acids from different sources that comprisetreating a first nucleic acid from a first source with a modifying agentto generate a treated nucleic acid; providing a single reaction mixturecomprising the treated nucleic acid and a second nucleic acid from asecond source; subjecting the single reaction mixture to a singleanalytical reaction, wherein the single analytical reaction providesboth sequence data and modification data for both the treated nucleicacid and the second nucleic acid; analyzing the sequence data todetermine a set of sequence reads; and analyzing the modification datato identify which of the set of sequence reads corresponds to thetreated nucleic acid and which of the set of sequence reads correspondsto the second nucleic acid, thereby differentiating the first nucleicacid from the second nucleic acid.

In still further aspects, methods are provided for a generating ahaplotype comprising both sequence data and base modification data. Apreferred embodiment of such a method comprises fragmenting a genomicDNA sample to generate fragments at least 2 kb in length, wherein aproduct of the fragmenting is a fragment comprising a region ofinterest; subjecting the fragment to a single-molecule sequencingreaction to generate a single sequence read that extends the full lengthof the fragment comprising the region of interest, wherein the singlesequence read comprises base sequence information and kineticinformation that is indicative of modified bases within the fragment;and analyzing the base sequence information and the kinetic informationin the single sequence read to determine a base sequence for the regionof interest and an identification and location of modified bases withinthe region of interest, thereby generating a haplotype for the region ofinterest that comprises both sequence data and base modification data.In certain embodiments, a plurality of fragments comprising the regionof interest are each subjected to a single-molecule sequencing reactionto generate a plurality of single sequence reads, and the analyzingfurther comprises generating a plurality of haplotypes for the region ofinterest, and further constructing a consensus haplotype sequence fromthe plurality of haplotypes. The genomic DNA sample optionally comprisesone or a combination of human DNA, bacterial DNA, viral DNA, fungal DNA,a forensic DNA sample, a patient's DNA sample, a diagnostic DNA sample,a prognostic DNA sample, embryonic DNA, and DNA from a cancer cell,e.g., from a tumor or metastatic cell, e.g., in a blood sample.

In still further aspects, methods are provided for assembling sequencereads from different nucleic acid fragments to generate a single contigfor an entire bacterial chromosome. Preferred embodiments of suchmethods comprise synchronizing bacterial cells that are activelygrowing, and titrating an amount of a modified base into the culturemedium such that the amount of the modified base incorporated into newlyreplicated chromosomes is significantly different during differentpoints during replication of the bacterial chromosome, therebygenerating newly synthesized bacterial chromosomes having asignificantly different amount of the modified base incorporated nearthe origin than near the ter sequence; fragmenting the newly synthesizedbacterial chromosomes to generate chromosomal fragments; subjecting thechromosomal fragments to single-molecule sequencing reactions togenerate a set of sequence reads comprising both sequence data andmodification data for the chromosomal fragments; analyzing the sequencedata to determine a nucleotide sequence for each chromosomal fragment;analyzing the modification data to determine an amount of modified basesin each chromosomal fragment; and based on the sequence data andmodification data, assembling the set of sequence reads to generate asingle contig for the entire bacterial chromosome.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 provides an exemplary illustration of single-molecule, real-time(SMRT®) nucleic acid sequencing.

FIG. 2 provides illustrative examples of various types of reaction datain the context of a pulse trace.

FIG. 3 provides an exemplary system of the invention.

DETAILED DESCRIPTION OF THE INVENTION I. General

The present invention is generally directed to methods and compositionsuseful in simultaneous detection of base sequences and modifications(e.g., epigenetic modifications, DNA damage, etc.) within nucleic acidsequences. In particularly preferred aspects, modified nucleotideswithin sequence templates are detected during nucleic acid sequencingreactions through the use of single molecule nucleic acid analysis suchthat the resulting sequence read(s) comprises both nucleotide base(“nucleobases”) sequence data and base modification data. In preferredembodiments, kinetic information in the sequence data is indicative ofnot only the position of a modified base, but also the type of basemodification, without the need to compare to other sequence reads wherethat have been treated to provide base sequence differences that areused to determine the presence of a modified base, e.g., in conventionalbisulfite sequencing. As such, a single read from a single molecule, ora plurality of reads from the single molecule, is sufficient to provideboth the base sequence, as well as location and identity of modifiedbases, for a single template nucleic acid. The ability to detectmodifications within nucleic acid sequences is useful for mapping suchmodifications in various types and/or sets of nucleic acid sequences,e.g., across a set of mRNA transcripts, across a chromosomal region ofinterest, or across an entire genome. The modifications so mapped canthen be related to transcriptional activity, secondary structure of thenucleic acid, siRNA activity, mRNA translation dynamics, kinetics and/oraffinities of DNA- and RNA-binding proteins, and other aspects ofnucleic acid (e.g., DNA and/or RNA) metabolism.

Further, modifications in genomic sequences can also be indicative ofcertain biologically important characteristics of the organism fromwhich the genomic sequences were isolated. For example, patterns ofmodified bases can be used to a) identify a particular bacterial strainor isolate, b) distinguish between different cell types in a sample,e.g., biopsy, or c) inform on the health status of an organism. Theseexamples are not meant to be limiting, and these and further examplesare described in greater detail in the descriptions that follow.

In preferred embodiments, detection of modifications within nucleic acidmolecules (“templates”) does not require amplification or copying of themolecules prior to detection. In fact, in some cases amplification ofnucleic acid molecules results in nucleic acid products of suchamplification that are lacking the modification to be detected. Infurther preferred embodiments, detection of modifications within nucleicacid molecules as described herein does not require further modificationto the nucleic acid molecules prior to detection. In particular,treatment with bisulfite is not typically required for detection ofmethylated bases. However, certain embodiments described herein docomprise amplification and/or nucleic acid modifying agents, so theiruse is not excluded from all embodiments of the methods of the instantinvention.

Although certain embodiments of the invention are described in terms ofdetection of modified nucleotides or other modifications in asingle-stranded DNA molecule (e.g., a single-stranded template DNA),various aspects of the invention are applicable to many different typesof nucleic acids, including e.g., single- and double-stranded nucleicacids that may comprise DNA (e.g., genomic DNA, mitochondrial DNA, viralDNA, etc.), RNA (e.g., mRNA, siRNA, microRNA, rRNA, tRNA, snRNA,ribozymes, etc.), RNA-DNA hybrids, PNA, LNA, morpholino, and other RNAand/or DNA hybrids, analogs, mimetics, and derivatives thereof, andcombinations of any of the foregoing. Nucleic acids for use with themethods, compositions, and systems provided herein may consist entirelyof native nucleotides, or may comprise non-natural or non-cognatebases/nucleotides (e.g., synthetic and/or engineered) that may be pairedwith native nucleotides or may be paired with the same or a differentnon-natural or non-cognate base/nucleotide. In certain preferredembodiments, the nucleic acid comprises a combination of single-strandedand double-stranded regions, e.g., such as the templates described inU.S. Ser. Nos. 12/383,855 and 12/413,258, both filed on Mar. 27, 2009and incorporated herein by reference in their entireties for allpurposes. In particular, mRNA modifications are difficult to detect bytechnologies that require reverse transcriptase PCR amplificationbecause such treatment does not maintain the modification in theamplicons. The present invention utilizes methods for analyzingmodifications in RNA molecules that do not require such amplification.More generally, in certain embodiments, methods are provided that do notrequire amplification of a modification-containing nucleic acid.

Generally speaking, the methods of the invention involve monitoring ofan analytical reaction to collect “reaction data,” wherein the reactiondata is indicative of the progress of the reaction. Reaction dataincludes data collected directly from the reaction, as well as theresults of various manipulations of that directly collected data, any ora combination of which can serve as a signal for the presence of amodification in the template nucleic acid. Reaction data gathered duringa reaction is analyzed to identify characteristics indicative of thepresence of a modification, and typically such data comprises changes orperturbations relative to data generated in the absence of themodification. For example, certain types of reaction data are collectedin real time during the course of the reaction, such as metrics relatedto reaction kinetics, affinity, rate, processivity, signalcharacteristics, error types and rates thereof, and the like. As usedherein, “kinetics,” “kinetic signature,” “kinetic response,” “activity,”and “behavior” of a reaction component (e.g., an enzyme, binding agent,etc.) or the reaction as a whole generally refer to reaction datarelated to the progression of the reaction under investigation and areoften used interchangeably herein. Signal characteristics vary dependingon the type of analytical reaction being monitored. Signalcharacteristics can refer to detection of luminescent or fluorescentemissions, changes in current or resistance, fluctuations in pH in thereaction mixture, and the like. For example, some reactions usedetectable labels to tag one or more reaction components, and signalcharacteristics for a detectable label include, but are not limited to,the type of signal (e.g., wavelength, charge, etc.) and the shape of thesignal (e.g., height, width, curve, etc.). Other reactions use themagnitude of a change in current flowing through a biological and/orsolid-state nano-scale hole (“nanopore”) to detect passage of molecules(e.g., nucleotide bases, singly or within a polynucleotide molecule)thorough the hole. Different analytes cause different disruptions incurrent through the nanopore, and the characteristics of thesedisruptions can be used to detect passage of a particular analyte.Further, signal characteristics for multiple signals (e.g., temporallyadjacent signals) can also be used, including, e.g., the distancebetween signals during a reaction, the number and/or kinetics of extrasignals (e.g., that do not correspond to the progress of the reaction,such as cognate or non-cognate sampling during sequencing-by-synthesis),internal complementarity, and the local signal context (i.e., one ormore signal that precede and/or follow a given signal). For example,template-directed sequencing reactions often combine signal data frommultiple nucleotide incorporation events to generate a sequence read fora nascent strand synthesized, and this sequence read is used to derive,e.g., by complementarity, the sequence of the template strand. Othertypes of reaction data are generated from statistical analysis of realtime reaction data, including, e.g., accuracy, precision, conformance,error rates, etc. In some embodiments, data from a source other than thereaction being monitored is also used. For example, a sequence readgenerated during a nucleic acid sequencing reaction can be compared tosequence reads generated in replicate experiments, or to known orderived reference sequences from the same or a related biologicalsource. Alternatively or additionally, a portion of a template nucleicacid preparation can be amplified using unmodified nucleotides andsubsequently sequenced to provide an experimental reference sequence tobe compared to the sequence of the original template in the absence ofamplification. Although certain specific embodiments of the use ofparticular types of reaction data to detect certain kinds ofmodifications are described at length herein, it is to be understoodthat the methods, compositions, and systems are not limited to thesespecific embodiments. Different types of reaction data can be combinedto detect various kinds of modifications, and in certain embodimentsmore than one type of modification can be detected and identified duringa single reaction on a single template. For example, some sequencingreactions involve use of an enzyme to control passage of a nucleic acidthrough a nanopore, and in such cases reaction data can include bothkinetics and other behavior of the enzyme and fluctuations in currentthrough the nanopore. For example, ratchet proteins, helicases, or motorproteins can be used to push or pull a nucleic acid molecule through ahole in a biological or synthetic membrane. The kinetics of theseproteins can vary depending on the sequence context of a nucleic acid onwhich they are acting. For, example, they may slow down or pause at amodified base, and this behavior, captured as a part of the reactiondata, is indicative of the presence of the modified base even where themodified base is not within the sensing portion of the nanopore. Suchvariations to the detailed embodiments of the invention will be clear toone of ordinary skill based upon the teachings provided herein.

In certain embodiments, redundant sequence information is generated andanalyzed to detect one or more modifications in a template nucleic acid.Redundancy can be achieved in various ways, including carrying outmultiple sequencing reactions using the same original template, e.g., inan array format, e.g., a ZMW array. In some embodiments in which alesion is unlikely to occur in all the copies of a given template,reaction data (e.g., sequence reads, kinetics, signal characteristics,signal context, and/or results from further statistical analyses)generated for the multiple reactions can be combined and subjected tostatistical analysis to determine a consensus sequence for the template.In this way, the reaction data from a region in a first copy of thetemplate can be supplemented and/or corrected with reaction data fromthe same region in a second copy of the template. Similarly, a templatecan be amplified (e.g., via rolling circle amplification) to generate aconcatemer comprising multiple copies of the template, and theconcatemer can be subjected to sequencing, thereby generating asequencing read that is internally redundant. As such, the sequence datafrom a first segment of the concatemer (corresponding to a first regionof the template) can be supplemented and/or corrected with sequence datafrom a second segment of the concatemer also corresponding to the firstregion of the template. Alternatively or additionally, a template can besubjected to repeated sequencing reactions to generate redundantsequence information that can be analyzed to more thoroughlycharacterize the modification(s) present in the template. In certainembodiments, further benefits are realized by comparing sequencegenerated from nucleic acids having modifications to the same nucleicacids lacking the modifications. For example, where there is ambiguityin the sequence of a nucleic acid at a given locus, modification of thatlocus and resequencing of the nucleic acid will provide a further “call”at that locus that will supplement the data provided by the unmodifiednucleic acid and allow determination of the true sequence at that locus.Likewise, a template having modifications can be sequenced andsubsequently treated to remove the modifications prior to resequencing.Data from the modified and unmodified templates is used together to callthe base in question. In both of these strategies, the kinetics ofpolymerization will change if the modification status of the basechanges, and this difference (or lack thereof) is a key aspect of theadditional data used to finally determine the sequence at a givenposition. Of course, these methods are not limited to using a singletemplate molecule in both sequencing reactions, and aliquots of a givensample nucleic acid can be sequenced in parallel, one treated and oneuntreated, to attain the same type of data sets. Additional informationon templates and methods for redundant sequencing of template moleculesare described more fully in U.S. Pat. Nos. 7,476,503, 7,906,284, and7,901,889; U.S. Patent Publication Nos. 20080161194, 20080161195,20090280538, 20090298075, 20110212436, and 20110281768; and U.S. patentapplication Ser. No. 13/403,789, filed Feb. 23, 2012, all of which areincorporated herein in their entireties for all purposes.

The term “modification” as used herein is intended to refer not only toa chemical modification of a nucleic acids, but also to a variation innucleic acid conformation or composition, interaction of an agent with anucleic acid (e.g., bound to the nucleic acid), and other perturbationsassociated with the nucleic acid. As such, a location or position of amodification is a locus (e.g., a single nucleotide or multiplecontiguous or noncontiguous nucleotides) at which such modificationoccurs within the nucleic acid. Of particular interest are modificationsthat have regulatory functions within a genome, e.g., epigeneticmodifications. For example, hypermethylation is an epigeneticmodification known to reduce or prevent gene expression. For adouble-stranded template, a modification may occur in the strandcomplementary to a nascent strand synthesized by a polymerase processingthe template, or may occur in the displaced strand. In some embodiments,a modification can even occur in a base incorporated into a nascentstrand during a template-directed sequencing reaction. Although certainspecific embodiments of the invention are described in terms of5-methylcytosine detection, detection of other types of modifiednucleotides (e.g., N⁶-methyladenosine, N³-methyladenosine,N⁷-methylguanosine, 5-hydroxymethylcytosine, other methylatednucleotides, pseudouridine, thiouridine, isoguanosine, isocytosine,dihydrouridine, queuosine, wyosine, inosine, triazole, diaminopurine,β-D-glucopyranosyloxymethyluracil (a.k.a., β-D-glucosyl-HOMedU,β-glucosyl-hydroxymethyluracil, “dJ,” or “base J”), 8-oxoguanosine(a.k.a., “8-oxoG,” 8-oxo-7,8-dihydroguanine, 8-oxoguanine,7,8-dihydro-8-oxoguanine, and 8-hydroxyguanine), and 2′-O-methylderivatives of adenosine, cytidine, guanosine, and uridine) are alsocontemplated. Further, although described primarily in terms of DNAtemplates, such modified bases can be modified RNA bases and can bedetected in RNA (or primarily RNA) templates. These and othermodifications are known to those of ordinary skill in the art and arefurther described, e.g., in Narayan P, et al. (1987) Mol Cell Biol7(4):1572-5; Horowitz S, et al. (1984) Proc Natl Acad Sci U.S.A.81(18):5667-71; “RNA's Outfits: The nucleic acid has dozens of chemicalcostumes,” (2009) C&EN; 87(36):65-68; Kriaucionis, et al. (2009) Science324 (5929): 929-30; and Tahiliani, et al. (2009) Science 324 (5929):930-35; Matray, et al. (1999) Nature 399(6737):704-8; Ooi, et al. (2008)Cell 133: 1145-8; Petersson, et al. (2005) J Am Chem Soc.127(5):1424-30; Johnson, et al. (2004) 32(6):1937-41; Kimoto, et al.(2007) Nucleic Acids Res. 35(16):5360-9; Ahle, et al. (2005) NucleicAcids Res 33(10):3176; Krueger, et al., Curr Opinions in Chem Biology2007, 11(6):588); Krueger, et al. (2009) Chemistry & Biology 16(3):242;McCullough, et al. (1999) Annual Rev of Biochem 68:255; Liu, et al.(2003) Science 302(5646):868-71; Limbach, et al. (1994) Nucl. Acids Res.22(12):2183-2196; Wyatt, et al. (1953) Biochem. J. 55:774-782; Josse, etal. (1962) J. Biol. Chem. 237:1968-1976; Lariviere, et al. (2004) J.Biol. Chem. 279:34715-34720; and in International ApplicationPublication No. WO/2009/037473, the disclosures of which areincorporated herein by reference in their entireties for all purposes.Modifications further include the presence of non-natural (e.g.,non-standard, non-cognate, synthetic, etc.) base pairs in the templatenucleic acid, including but not limited to hydroxypyridone andpyridopurine homo- and hetero-base pairs, pyridine-2,6-dicarboxylate andpyridine metallo-base pairs, pyridine-2,6-dicarboxamide and a pyridinemetallo-base pairs, metal-mediated pyrimidine base pairs T-Hg(II)-T andC—Ag(I)—C, and metallo-homo-basepairs of2,6-bis(ethylthiomethyl)pyridine nucleobases Spy,6-amino-5-nitro-3-(1′-β-D-2′-deoxyribofuranosyl)-2(1H)-pyridone (dZ),2-amino-8-(1′-β-D-2′-deoxyribofuranosyl)-imidazo[1,2-a]-1,3,5-triazin-4(8H)-one(dP), and alkyne-, enamine-, alcohol-, imidazole-, guanidine-, andpyridyl-substitutions to the purine or pyridimine base (Wettig, et al.(2003) J Inorg Biochem 94:94-99; Clever, et al. (2005) Angew Chem Int Ed117:7370-7374; Schlegel, et al. (2009) Org Biomol Chem 7(3):476-82;Zimmerman, et al. (2004) Bioorg Chem 32(1):13-25; Yanagida, et al.(2007) Nucleic Acids Symp Ser (Oxf) 51:179-80; Zimmerman (2002) J AmChem Soc 124(46):13684-5; Buncel, et al. (1985) Inorg Biochem 25:61-73;Ono, et al. (2004) Angew Chem 43:4300-4302; Lee, et al. (1993) BiochemCell Biol 71:162-168; Loakes, et al. (2009), Chem Commun 4619-4631;Yang, et al. (2007) Nucleic Acids Res. 35(13):4238-4249; Yang, et al.(2006) Nucleic Acids Res. 34(21):6095-6101; Geyer, et al. (2003)Structure 11: 1485-1498; and Seo, et al. (2009) J Am Chem Soc131:3246-3252, all incorporated herein by reference in their entiretiesfor all purposes). Other types of modifications include, e.g, a nick, amissing base (e.g., apurinic or apyridinic sites), a ribonucleoside (ormodified ribonucleoside) within a deoxyribonucleoside-based nucleicacid, a deoxyribonucleoside (or modified deoxyribonucleoside) within aribonucleoside-based nucleic acid, dUTP, rTTP, a pyrimidine dimer (e.g.,thymine dimer or cyclobutane pyrimidine dimer), a cis-platincrosslinking, oxidation damage, hydrolysis damage, other methylatedbases, bulky DNA or RNA base adducts, photochemistry reaction products,interstrand crosslinking products, mismatched bases, and other types of“damage” to the nucleic acid. As such, certain embodiments describedherein refer to “damage” and such damage is also considered amodification of the nucleic acid in accordance with the presentinvention. Modified nucleotides can be caused by exposure of the DNA toradiation (e.g., UV), carcinogenic chemicals, crosslinking agents (e.g.,formaldehyde), certain enzymes (e.g., nickases, glycosylases,exonucleases, methylates/methyltransferases (e.g., m4Cmethyltransferase, DAM, etc.), other nucleases, glucosyltransferases,etc.), viruses, toxins and other chemicals, thermal disruptions, and thelike. Modified nucleotides can be found in native nucleic acids, can beadded by treatment in vitro, or cells can be engineered to introducesuch modifications in vivo, e.g., by transformation with an appropriatemodification-introducing agent, such as the enzymes listed above.

A preferred enzyme for use with certain methods described herein is anon-specific adenine methyltransferase described in U.S. ProvisionalPatent Application No. 61/599,230, filed Feb. 15, 2012, and incorporatedherein by reference in their entireties for all purposes. Variousmethods can be used to identify such a methyltransferase, such as thosedescribed in Fang, et al. (2012) Nature Biotechnology 30(12):1232-1239,which is incorporated herein by reference in its entirety for allpurposes. Methyltransferases having specific recognition sequences or“motifs” have been used for chromatin mapping. A chromatin sample istreated with such an enzyme, which introduces methyl groups on exposedor accessible portions of the chromatin. Subsequent detection of methylgroups provides data on which regions of the nucleic acid within thechromatin were accessible to the methyltransferase, but only where thoseportions comprise a sequence motif that is recognized by the particularmethyltransferase being used. As such, exposed portions lacking thesequence motif would not be detectable by this method. In contrast, byusing a nonspecific methyltransferase rather than a sequence-specificmethyltransferase, the “methyl labeling” of the nucleic acid componentof the chromatin would not be dependent upon a recognition sequence inthe exposed nucleic acids, thereby allowing detection of accessibleregions without a particular recognition sequence, as long as a“methylatable nucleoside” (e.g., C or A) is present. In certainpreferred embodiments, the enzyme is a 6-mA methyltransferase, in otherembodiments it is a 4-mC methylatransferase, in further embodiments itis a phosphorothioate modification enzyme, and in still furtherembodiments a combination of such modification enzymes is used toidentify accessible nucleic acid regions within a chromatin molecule.

In vivo, DNA damage is a major source of mutations leading to variousdiseases including cancer, cardiovascular disease, and nervous systemdiseases (see, e.g., Lindahl, T. (1993) Nature 362(6422): 709-15, whichis incorporated herein by reference in its entirety for all purposes).The methods and systems provided herein can also be used to detectvarious conformations of DNA, in particular, secondary structure formssuch as hairpin loops, stem-loops, internal loops, bulges, pseudoknots,base-triples, supercoiling, internal hybridization, and the like; andare also useful for detection of agents interacting with the nucleicacid, e.g., bound proteins such as chromatin components (e.g., histones,RNAs, etc.), transcription factors, regulatory proteins or RNAs,replication factors, or other moieties.

Further, nucleic acids known to have or suspected of having one or moremodifications of interest can be targeted, e.g., using antibodies orother binding agents specific to the one or more modifications, and thenucleic acids containing the one or more modifications can be selectedor “captured” by various methods known in the art, e.g.,immunoprecipitation, column chromatography, bead separations, etc. Oncethe nucleic acids that do not contain the one or more modifications areremoved, e.g., by washing, buffer exchange, etc., the selected nucleicacids can be subjected to template-directed or other sequencing toidentify and/or map the one or more modifications. For example, affinitypurification can be utilized to isolate nucleic acids comprising amodification of interest. Affinity purification makes use of specificbinding interactions between molecules. In certain preferred methods, aparticular ligand is chemically immobilized or coupled to a solidsupport (e.g., column) so that when a complex mixture is passed over thesupport those molecules having specific binding affinity to the ligandbecome bound. After other components of the mixture are washed away, thebound molecule is stripped from the support, resulting in itspurification from the original complex mixture. It is within the skillof the ordinary artisan to determine a ligand that has a specificbinding affinity for a modification of interest, and such ligands caninclude, without limitation, antibodies, enzymes, nucleic acids, nucleicacid binding agents, fusion proteins, and the like. Further, prior toaffinity purification, the modification of interest may be subjected toa treatment that adds a tag to facilitate purification, e.g., anantigen, binding partner, protein tag, etc. Affinity purificationmethods, tags, and supports are well known, widely used, andcommercially available for use by those of ordinary skill in the art.

In certain aspects, the described methods and compositions involvedetection of modifications within a single template molecule duringsequencing, as well as determination of their location (i.e. “mapping”)within the template molecule. In certain preferred embodiments,high-throughput, real-time, single-molecule, template-directedsequencing assays are used to detect the presence of such modified sitesand to determine their location on the DNA template, e.g., by monitoringthe progress and/or kinetics of a polymerase enzyme processing thetemplate. For example, when a polymerase enzyme encounters certain typesof damage or other modifications in a DNA template, the progress of thepolymerase can be temporarily or permanently blocked, e.g., resulting ina paused or dissociated polymerase. As such, the detection of a pause inor termination of nascent strand synthesis is indicative of the presenceof such damage or lesion. Similarly, certain types of modificationscause other perturbations in the activity of the polymerase, such aschanges in the kinetics of nascent strand synthesis, e.g., changes inpulse width or interpulse duration. Yet further, some modificationscause changes in the enzyme activity that are detectable as changes inthe error metrics of the enzyme during template-directed polymerization.By analysis of the sequence reads produced prior to the change orperturbation in activity of the polymerase, and alternatively oradditionally after reinitiation of synthesis, one can map the site ofthe damage or lesion on the template. Since different types of lesionscan have different effects on the progress of the polymerase on thesubstrate, in certain cases the behavior of the polymerase on thetemplate not only informs as to where the lesion occurs, but also whattype of lesion is present. Preferred sequencing methods for detection ofmodified bases during real-time sequencing are described at length inU.S. Patent Publication No. 20110183320; PCT Patent Application No.PCT/US2011/060338 (WO 2012065043); Clark, et al. (2013) BMC Biology 11:4doi:10.1186/1741-7007-11-4; Lluch-Senar, et al. (2013) PLoS Genet 9(1):e1003191. doi:10.1371/journal.pgen.1003191; Fang, et al. (2012) NatureBiotechnology 30:1232-1239; Ynull, et al. (2012) Genome Biology 13:175;Schadt, et al. (2012) Genome Research doi:10.1101/gr.136739.111; Murray,et al. (2012) Nucl. Acids Res. doi: 10.1093/nar/gks891; Clark, et al.(2011) Genome Integrity 2:10 doi:10.1186/2041-9414-2-10; Clark, et al.(2011) Nucl. Acids Res. doi: 10.1093/nar/gkr1146; Song, et al. (2011)Nature Methods doi:10.1038/nmeth.1779; and Flusberg, et al. (2010)Nature Methods doi:10.1038/nmeth.1459, the disclosures of which areincorporated herein by reference in their entireties for all purposes.

While certain compositions and method provided herein are described interms of sequencing reactions that utilize single-molecule, real-time,polymerase-mediated, template-directed nascent strand synthesis, thepresent invention is not limited to use of these sequencing reactionsand other sequencing technologies capable of providing reaction dataindicative of a modification within a template are also contemplated.For example, in certain preferred embodiments, real-time,single-molecule, electrical-detection-based sequencing assays are usedto detect the presence of such modified sites and to determine theirlocation on the DNA template, e.g., by monitoring the progress and/orkinetics of the passage of a nucleic acid template through aconstriction through which a current flows. The presence of the templatewithin the passage (e.g., nanopore or other nano-scale hole, e.g., poreprotein or synthetically-engineered nanohole) perturbs the current flowin a sequence-specific manner to provide a signal for identifying whichnucleotide base is in a “detection region” within the hole. The speed atwhich the template passes through the hole depends on various factors,including the force used to push or pull the template through the hole,the relative size of the hole, and the sequence of bases within thetemplate strand. For example, a base comprising a bulky modification,e.g., methyl groups, sugar moieties, and the like, is expected toperturb the current to a greater degree than a non-modified base,providing a relatively greater signal than that of the non-modifiedbase, based at least on steric hindrance caused by the larger size ofthe modified base. Yet further, the mechanism by which the template ispushed or pulled through the hole is clearly a major factor in thekinetics of passage, and that mechanism can be affected by the presenceof modified bases. For example, in certain embodiments a helicase isused to unwind a double-stranded nucleic acid and send one of the twostrands into the hole to be detected. Where modifications present in theduplex molecule inhibit unwinding by the helicase, the kinetics ofpassage will slow and a longer transition from one base to the next willbe detectable within the nanohole. Likewise, and as described for thesequencing-by-synthesis reactions above, the kinetics of a polymeraseenzyme that is processing a template is affected by base modifications.As such, the kinetics of nanohole sequencing reactions that use asecondary “driving” mechanism, e.g., an enzyme (polymerase, helicase,etc.), motor protein, ratchet protein, and the like) to drive a templatestrand through the nanohole will be impacted by the effect ofmodifications on the driving mechanism or “driver.” Further, since thekinetic changes at the driver are uncoupled from the subsequentdetection of the modified bases themselves as they pass through thenanopore, analysis of the kinetics of such a sequencing strategy willneed to account for not only the current and kinetic effects within thenanopore, but also factor in kinetic effects at the driver. As such,single-molecule, real-time sequencing technologies include not onlypolymerase-mediated sequencing-by-synthesis reactions, but alsonanopore-based sequencing technologies, and preferred methods are thosethat are able to detect modifications (e.g., within a template molecule)during a sequencing reaction in which the base identities are alsodetermined. By analysis of the base and modification data within thesequence reads, one can map the site of the damage or lesion on thetemplate. These and other aspects of the invention are described ingreater detail in the description and examples that follow.

II. Single Molecule Sequencing

In certain aspects of the invention, single molecule real timesequencing systems are applied to the detection of modified nucleic acidtemplates through analysis of the reaction data (e.g., sequence and/orkinetic data) derived from such systems. In particular, modifications ina template nucleic acid strand can cause unique and identifiablealterations in an analytical reaction that allow the modifications to beidentified. For example, in certain embodiments, modifications in atemplate nucleic acid strand alter the enzymatic activity of a nucleicacid polymerase in various ways, e.g., by increasing the time for abound nucleobase to be incorporated and/or increasing the time betweenincorporation events. The alteration in enzymatic activity canoptionally be detected using a nucleic acid sequencing technology thatdetects incorporation of nucleotides (e.g., comprising a detectablelabel) into a nascent strand in real time. Such methods are described indetail in International Patent Application No. PCT/US2011/060338, filedNov. 11, 2011, and incorporated herein by reference in its entirety forall purposes. In other embodiments, modifications in a template changethe way in which the current through a nanopore is perturbed duringpassage of the template. In preferred embodiments, such modificationsare detected using a single molecule nucleic acid sequencing technology,where a sequence read generated corresponds to a single molecule of anucleic acid template. In preferred embodiments, a single moleculenucleic acid sequencing technology is capable of real-time detection ofindividual nucleotides, e.g. during nucleotide incorporation or passagethrough a nanohole. Such sequencing technologies are known in the artand include, e.g., the SMRT® sequencing and nanopore sequencingtechnologies. For more information on nanopore sequencing, see, e.g.,U.S. Pat. No. 5,795,782; Kasianowicz, et al. (1996) Proc Natl Acad SciUSA 93(24):13770-3; Ashkenas, et al. (2005) Angew Chem Int Ed Engl44(9):1401-4; Howorka, et al. (2001) Nat Biotechnology 19(7):636-9;Astier, et al. (2006) J Am Chem Soc 128(5):1705-10; U.S. Ser. No.13/083,320, filed Apr. 8, 2011; and Zhao, et al. (2007) Nano Letters7(6):1680-1685, all of which are incorporated herein by reference intheir entireties for all purposes.

With regards to nucleic acid sequencing, the term “template” refers to anucleic acid molecule subjected to a sequencing reaction. For example,in a sequencing-by-synthesis reaction a template is the molecule used bya polymerase to direct synthesis of the nascent strand; e.g., it iscomplementary to the nascent strand being produced. In a nanopore-basedsequencing method, the template is the nucleic acid passed through thenanopore, whether intact or after nucleolytic degradation. A templatemay comprise, e.g., DNA, RNA, or analogs, mimetics, derivatives, orcombinations thereof, as described elsewhere herein. Further, a templatemay be single-stranded, double-stranded, or may comprise both single-and double-stranded regions. For sequencing technologies thatinterrogate a single strand of a double-stranded template, the “templatestrand” is the strand that is interrogated, e.g., passed through ananopore, used to synthesize a nascent strand, or hybridized tosequence-specific probes (as in SOLiD® sequencing). A modification in adouble-stranded template may be in the strand subjected tointerrogation, or in the complementary strand. For example, themodification may be present in a strand used by a polymerase to generatea complementary strand, or may be in a complementary strand that isdisplaced by the polymerase. Yet further, the modification may be in astrand passed through a nanopore, or in a complementary strand that isdisplaced from the strand passed through the nanopore, such as where themodification affect the kinetics of displacement, e.g., by a ratchetprotein, helicase, polymerase, etc. In such a case, the modification canbe detected without having been passed through the hole. Simply put, amodification being detected may occur in a template strand or acomplement thereof.

Certain direct modification sequencing methods can be performed usingsingle-molecule real-time sequencing systems, e.g., that illuminate andobserve individual reaction complexes continuously over time, such asthose developed for SMRT® sequencing (see, e.g., P. M. Lundquist, etal., Optics Letters 2008, 33, 1026, which is incorporated herein byreference in its entirety for all purposes). The foregoing SMRT®sequencing instrument generally detects fluorescence signals from anarray of thousands of ZMWs simultaneously, resulting in highly paralleloperation. Each ZMW, separated from others by distances of a fewmicrometers, represents an isolated sequencing chamber for a singlepolymerase acting upon a single template nucleic acid molecule. Othersingle-molecule, real-time sequencing methods and systems are alsoapplicable to the methods herein, including without limitation thosedescribed in U.S. Ser. No. 13/083,320, filed Apr. 8, 2011; and U.S.Patent Publication No. 20120014832, the disclosures of which areincorporated herein by reference in their entireties for all purposes.

Detection of single molecules or molecular complexes in real time, e.g.,during the course of an analytical reaction, generally involves director indirect disposal of the analytical reaction such that each moleculeor molecular complex to be detected is individually resolvable. In thisway, each analytical reaction can be monitored individually, even wheremultiple such reactions are immobilized on a single support, e.g.,substrate, membrane, or other surface. Individually resolvableconfigurations of analytical reactions can be accomplished through anumber of mechanisms, and typically involve immobilization of at leastone component of a reaction at a reaction site, e.g, a ZMW or nanopore(which can be considered both a component of the reaction and thereaction site itself). Various methods of providing individuallyresolvable configurations are known in the art, e.g., see EuropeanPatent No. 1105529 to Balasubramanian, et al.; and PublishedInternational Patent Application No. WO 2007/041394, the fulldisclosures of which are incorporated herein by reference in theirentireties for all purposes. A reaction site on a substrate is generallya location on the substrate at which a single analytical reaction (e.g.,comprising a single template molecule) is performed and monitored,preferably in real time. A reaction site may be on a planar surface ofthe substrate, or may be in an aperture in the surface of the substrate,e.g., a well, nanohole, or other aperture. In preferred embodiments,such apertures are “nanoholes,” which are nanometer-scale holes or wellsthat provide structural confinement of analytic materials of interestwithin a nanometer-scale diameter, e.g., ˜1-300 nm. In some embodiments,such apertures comprise optical confinement characteristics, such aszero-mode waveguides, which are also nanometer-scale apertures and arefurther described elsewhere herein. Typically, the observation volume(i.e., the volume within which detection of the reaction takes place) ofsuch an aperture is at the attoliter (10⁻¹⁸ L) to zeptoliter (10⁻²¹ L)scale, a volume suitable for detection and analysis of single moleculesand single molecular complexes.

The immobilization of a component of an analytical reaction can beengineered in various ways. For example, an enzyme (e.g., polymerase,reverse transcriptase, kinase, etc.) may be attached to the substrate ata reaction site, e.g., within an optical confinement or othernanometer-scale aperture. In other embodiments, a substrate in ananalytical reaction (for example, a nucleic acid template, e.g., DNA,RNA, or hybrids, analogs, derivatives, and mimetics thereof, or a targetmolecule for a kinase) may be attached to the substrate at a reactionsite. Certain embodiments of template immobilization are provided, e.g.,in U.S. patent application Ser. No. 12/562,690, filed Sep. 18, 2009 andincorporated herein by reference in its entirety for all purposes. Oneskilled in the art will appreciate that there are many ways ofimmobilizing nucleic acids and proteins into an optical confinement,whether covalently or non-covalently, via a linker moiety, or tetheringthem to an immobilized moiety. These methods are well known in the fieldof solid phase synthesis and micro-arrays (Beier et al., Nucleic AcidsRes. 27:1970-1-977 (1999)). Non-limiting exemplary binding moieties forattaching either nucleic acids or polymerases to a solid support includestreptavidin or avidin/biotin linkages, carbamate linkages, esterlinkages, amide, thiolester, (N)-functionalized thiourea, functionalizedmaleimide, amino, disulfide, amide, hydrazone linkages, among others.Antibodies that specifically bind to one or more reaction components canalso be employed as the binding moieties. In addition, a silyl moietycan be attached to a nucleic acid directly to a substrate such as glassusing methods known in the art. In yet further embodiments, a reactionis localized to a given position by virtue of the presence of a hole(e.g., nanopore) through with a template molecule passes during ananalytical reaction.

In some embodiments, a nucleic acid template is immobilized onto areaction site (e.g., within an optical confinement) by attaching aprimer comprising a complementary region at the reaction site that iscapable of hybridizing with the template, thereby immobilizing it in aposition suitable for monitoring. In certain embodiments, an enzymecomplex is assembled in an optical confinement, e.g., by firstimmobilizing an enzyme component. In other embodiments, an enzymecomplex is assembled in solution prior to immobilization. Where desired,an enzyme or other protein reaction component to be immobilized may bemodified to contain one or more epitopes for which specific antibodiesare commercially available. In addition, proteins can be modified tocontain heterologous domains such as glutathione S-transferase (GST),maltose-binding protein (MBP), specific binding peptide regions (seee.g., U.S. Pat. Nos. 5,723,584, 5,874,239 and 5,932,433), or the Fcportion of an immunoglobulin. The respective binding agents for thesedomains, namely glutathione, maltose, and antibodies directed to the Fcportion of an immunoglobulin, are available and can be used to coat thesurface of an optical confinement of the present invention. The bindingmoieties or agents of the reaction components they immobilize can beapplied to a support by conventional chemical techniques which are wellknown in the art. In general, these procedures can involve standardchemical surface modifications of a support, incubation of the supportat different temperature levels in different media comprising thebinding moieties or agents, and possible subsequent steps of washing andcleaning. In yet further embodiments, a template can be immobilized at areaction site by binding to an enzyme or other protein (e.g.,polymerase, motor protein, helicase, etc.) that serves only to deliverthe template to a sequence detector, e.g., a nanopore sensor.

In some embodiments, a substrate comprising an array of reaction sitesis used to monitor multiple biological reactions, each taking place at asingle one of the reaction sites. Various means of loading multiplebiological reactions onto an arrayed substrate are known to those ofordinary skill in the art and are described further, e.g., in U.S. Ser.No. 61/072,641, incorporated herein by reference in its entirety for allpurposes. For example, basic approaches include: creating a singlebinding site for a reaction component at the reaction site; removingexcess binding sites at the reaction site via catalytic or secondarybinding methods; adjusting the size or charge of the reaction componentto be immobilized; packaging or binding the reaction component within(or on) a particle (e.g., within a viral capsid), where a single suchparticle fits into the relevant reaction site (due to size or charge ofthe particle and/or observation volume); using non-diffusion-limitedloading; controllably loading the reaction component (e.g., usingmicrofluidic or optical or electrical control); sizing or selectingcharges in the reaction sites/observation volumes (e.g., the sizes ofoptical confinements in an array) to control which reaction componentswill fit (spatially or electrostatically) into which reactionsites/observation volumes; iterative loading of reaction components,e.g., by masking active sites between loading cycles; enriching theactivity of the reaction components that are loaded; usingself-assembling nucleic acids to sterically control loading; adjustingthe size of the reaction site/observation volume; and many others. Suchmethods and compositions provide for the possibility of completelyloading single-molecule array reaction sites (instead of about 30% ofsuch sites as occurs in “Poisson limited” loading methods) with singlereaction components (e.g., molecular complexes).

In embodiments utilizing nano-scale holes and current-based detection,such as nanopore methods, it is beneficial to ensure that each detectionarea on a substrate contains only a single nano-scale aperture.Otherwise, two different templates may be passing through two differentholes simultaneously, and the resulting overlapping signals would bedifficult if not impossible to resolve. The surface of solid-statenanopores can be treated in order to either improve their sequencingperformance or to enable the creation of an hybrid protein/solid-statenanopore. In such a hybrid, the solid-state pore acts a substrate with ahole for the protein nanopore, which would be positioned as a plugwithin the hole. The protein nanopore would perform the sensing of DNAmolecules. This hybrid can the advantages of both types of nanopores:the possibility for batch fabrication, stability, compatibility withmicro-electronics, and a population of identical sensing subunits.Unlike methods where a lipid layer much larger than the width of aprotein nanopore is used, the hybrid nanopores are generally constructedsuch that the dimensions of the solid state pore are close to thedimensions of the protein nanopore. The solid state pore into which theprotein nanopore is disposed is generally from about 20% larger to aboutthree times larger than the diameter of the protein nanopore. Inpreferred embodiments the solid state pore is sized such that only oneprotein nanopore will associate with the solid state pore. An array ofhybrid nanopores is generally constructed by first producing an array ofsolid state pores in a substrate, selectively functionalizing thenanopores for attachment of the protein nanopore, then coupling orconjugating the nanopore to the walls of the solid state pore usinglinker/spacer chemistry. Techniques and materials for constructing suchhybrid nanopores, as well as ensuring a high portion of pores havingonly one nanopore per solid state pore, are further described in U.S.Ser. No. 13/083,320, filed Apr. 8, 2011, and incorporated herein byreference in its entirety for all purposes.

In preferred aspects, the methods, compositions, and systems providedherein utilize optical confinements to facilitate single moleculeresolution of analytical reactions. In preferred embodiments, suchoptical confinements are configured to provide tight optical confinementso only a small volume of the reaction mixture is observable. Some suchoptical confinements and methods of manufacture and use thereof aredescribed at length in, e.g., U.S. Pat. Nos. 7,302,146. 7,476,503,7,313,308, 7,315,019, 7,170,050, 6,917,726, 7,013,054, 7,181,122, and7,292,742; U.S. Patent Publication Nos. 20080128627, 20080152281, and200801552280; and U.S. Ser. Nos. 11/981,740 and 12/560,308, all of whichare incorporated herein by reference in their entireties for allpurposes.

Where reaction sites are located in optical confinements, the opticalconfinements can be further tailored in various ways for optimalconfinement of an analytical reaction of interest. In particular, thesize, shape, and composition of the optical confinement can bespecifically designed for containment of a given enzyme complex and forthe particular label and illumination scheme used.

In certain preferred embodiments of the invention, single-molecule,real-time sequencing systems already developed are applied to thedetection of modified nucleic acid templates through analysis of thesequence and kinetic data derived from such systems. For example,methylated cytosine and other modifications in a template nucleic acidwill alter the enzymatic activity of a polymerase processing thetemplate nucleic acid. Other real-time sequencing technologies sensitiveto modifications within a template nucleic acid can also be used todetect such modifications, e.g., nanopore sensor sequencing methods. Incertain embodiments, polymerase kinetics in addition to sequence readdata are detected using a single-molecule nucleic acid sequencingtechnology, e.g., the SMRT® sequencing technology developed by PacificBiosciences. This technique is capable of long sequencing reads and canbe used to provide high-throughput methylation profiling even in highlyrepetitive genomic regions, facilitating de novo sequencing ofmodifications such as methylated bases. SMRT® sequencing systemstypically utilize state-of-the-art single-molecule detectioninstruments, production-line nanofabrication chip manufacturing, organicchemistry, protein mutagenesis, selection and production facilities, andsoftware and data analysis infrastructures. Additional details onsingle-molecule, real-time microscopy, including sequencingapplications, are provided, e.g., in U.S. Pat. Nos. 7,056,661,7,476,503, 8,143,030, 7,901,889, and 6,917,726; Eid, J. et al. (2009)Science 323, 133; J. Korlach, et al. (2008) Nucleos. Nucleot. NucleicAcids 27, 1072; Lundquist, et al. (2008) Optics Letters 33(9): 1026; andLevene, et al. (2003) Science 299: 682, all of which are incorporatedherein by reference in their entireties for all purposes.

Certain preferred methods of the invention employ real-time sequencingof single DNA molecules (Eid, et al., supra), with intrinsic sequencingrates of multiple bases per second and average read lengths in the 1000to 10,000 base-pair range. In such sequencing, sequential base additionscatalyzed by DNA polymerase into the growing complementary nucleic acidstrand are detected with fluorescently labeled nucleotides. The kineticsof base additions and polymerase translocation are sensitive to thestructure of the DNA double-helix, which is impacted by the presence ofbase modifications, e.g, 5-MeC, 5-hmC, base J, etc., and othermodifications (secondary structure, bound agents, etc.) in the template.By monitoring the activity of DNA polymerase during sequencing, sequenceread information and base modifications can be simultaneously detected.Long, continuous sequence reads that are readily achievable using SMRT®sequencing facilitate modification (e.g., modification) profiling in lowcomplexity regions that are inaccessible to some technologies, such ascertain short-read sequencing technologies. Carried out in a highlyparallel manner, epigenomes can be sequenced directly, with singlebase-pair resolution and high throughput.

In preferred embodiments, optical confinements within which a singlesequencing reaction takes place are ZMW nanostructures, preferably in anarrayed format. Typically, ZMWs arrays comprise dense arrays of holes,˜100 nm in diameter, fabricated in a ˜100 nm thick metal film depositedon a transparent substrate (e.g., silicon dioxide). These structures arefurther described in the art, e.g., in M. J. Levene, et al., Science2003, 299, 682; and M. Foquet, et al., J. Appl. Phys. 2008, 103, 034301,the disclosures of which are incorporated herein by reference in theirentireties for all purposes. Each ZMW becomes a nanophotonicvisualization chamber for recording an individual polymerizationreaction, providing a detection volume of just 100 zeptoliters (10-21liters). This volume represents a 1000-fold improvement overdiffraction-limited confocal microscopy, facilitating observation ofsingle incorporation events against the background created by therelatively high concentration of fluorescently labeled nucleotides.Polyphosphonate and silane-based surface coatings mediate enzymeimmobilization to the transparent floor of the ZMW while blockingnon-specific attachments to the metal top and side wall surfaces (Eid,et al., supra; and J. Korlach, et al., Proc Natl Acad Sci USA 2008, 105,1176, the disclosures of which are incorporated herein by reference intheir entireties for all purposes). While certain methods describedherein involve the use of ZMW confinements, it will be readilyunderstood by those of ordinary skill in the art upon review of theteachings herein that these methods may also be practiced using otherreaction formats, e.g., on planar substrates or in nanometer-scaleapertures other than zero-mode waveguides. (See, e.g., U.S. Ser. No.12/560,308, filed Sep. 15, 2009; and U.S. Patent Publication No.20080128627, incorporated herein supra.)

Certain preferred embodiments of a sequencing-by-synthesis reaction tobe used with the methods described herein include phospholinkednucleotides comprising a detectable label (e.g., comprising afluorescent dye) attached to a phosphate group that is removed uponincorporation of the constituent nucleobases into a nascent strand. Forexample, typically a detectable label is linked to the terminalphosphate, but the label can also be attached to another phosphate groupthat is not the alpha phosphate, which is incorporated into thesugar-phosphate backbone of the nascent strand. (See, e.g., J. Korlach,et al., Nucleos. Nucleot. Nucleic Acids 2008, 27, 1072, which isincorporated herein by reference in its entirety for all purposes.) Evenwith 100% replacement of unmodified nucleotides by phospholinkednucleotides, the enzyme cleaves away the label as part of theincorporation process, leaving behind a completely natural,double-stranded nucleic acid product. Each of the four differentstandard nucleobases (e.g., dATP, dTTP, dCTP, and dGTP, or analogsthereof) is labeled with a distinct detectable label to discriminatebase identities during incorporation events, thus enabling sequencedetermination of the complementary DNA template. During incorporation,the enzyme holds the labeled nucleotide in the ZMW's detection volumefor tens of milliseconds, orders of magnitude longer than the averagediffusing nucleotide is present. Signal (e.g., fluorescence) is emittedcontinuously from the detectable label during the duration ofincorporation, causing a detectable pulse of increased fluorescence inthe corresponding color channel. The pulse is terminated naturally bythe polymerase releasing the pyrophosphate-linker-label group.Preferably, the removal of the linker and label during incorporation iscomplete such that the nucleotide incorporated has no remnants of thelinker or label remaining. The polymerase then translocates to the nextbase, and the process repeats.

The principle of SMRT® sequencing is illustrated in FIG. 1. Briefly, asshown in FIG. 1A, single DNA polymerase molecules with bound DNAtemplate are attached to a substrate, e.g., at the bottom of eachzero-mode waveguide. Polymerization of the complementary DNA strand isobserved in real time by detecting fluorescently labeled nucleotides. Atstep 1, the DNA template/primer/polymerase complex is surrounded bydiffusing fluorescently labeled nucleotides which probe the active site.A labeled nucleotide makes a cognate binding interaction with the nextbase in the DNA template that lasts for tens of milliseconds at step 2,during which fluorescence is emitted continuously. At step 3, thepolymerase incorporates the nucleotide (nucleobases, alpha phosphate,and sugar group) into the growing nucleic acid chain, thereby cleavingthe α-β phosphodiester bond and releasing a labeled polyphosphate. Theprocess repeats for each nucleotide incorporated into the nascent strandat steps 4 and 5, and monitoring the fluorescent signals emitted duringthe binding and incorporation of a nucleotide into the growing nascentstrand provides a sequence of nucleotide incorporations that can be usedto derive the sequence of the template nucleic acid. A prophetic traceis shown in FIG. 1B that comprises each step shown in 1A. At steps 2 and4, a fluorescent signal is emitted during binding and incorporation of anucleotide into the growing nucleic acid chain, and monitoring of thesefluorescent signals provides a sequence of nucleotide incorporationsthat can be used to derive the sequence of the template nucleic acid.For example, a 5′-G-A-3′ sequence in the growing chain indicates a5′-T-C-3′ sequence in the complementary template strand. The detectedseries of nucleotide incorporations is sometimes called a “sequenceread.”

As described above, reaction data is indicative of the progress of areaction and can serve as a signal for the presence of a modification inthe template nucleic acid. Reaction data in single molecule sequencingreaction reactions using fluorescently labeled bases is generallycentered around characterization of detected fluorescence pulses, aseries of successive pulses (“pulse trace” or one or more portionsthereof), and other downstream statistical analyses of the pulse andtrace data. Fluorescence pulses are characterized not only by theirspectrum, but also by other metrics including their duration, shape,intensity, and by the interval between successive pulses (see, e.g.,Eid, et al., supra; and U.S. Patent Publication No. 20090024331,incorporated herein by reference in its entirety for all purposes).While not all of these metrics are generally required for sequencedetermination, they add valuable information about the processing of atemplate, e.g., the kinetics of nucleotide incorporation and DNApolymerase processivity and other aspects of the reaction. Further, thecontext in which a pulse is detected (i.e., the one or more pulses thatprecede and/or follow the pulse) can contribute to the identification ofthe pulse. For example, the presence of certain modifications alters notonly the processing of the template at the site of the modification, butalso the processing of the template upstream and/or downstream of themodification. For example, the presence of modified bases in a templatenucleic acid has been shown to change the width of a pulse and/or theinterpulse duration (IPD), at the modified base and/or at one or morepositions proximal to it. A change in pulse width may or may not beaccompanied by a change in IPD. In addition, the types of nucleotides ornucleotide analogs being incorporated into a nascent strand can alsoaffect the sensitivity and response of a polymerase to a modification.For example, certain nucleotide analogs increase the sensitivity and/orresponse of the enzyme as compared to that in the presence of nativenucleotides or different nucleotide analogs, thereby facilitatingdetection of a modification. In particular, nucleotide analogscomprising different types of linkers and/or fluorescent dyes have beenshown to have different effects on polymerase activity, and can impactthe incorporation of a base into a nascent strand opposite amodification, and/or can impact the incorporation kinetics for apolynucleotide region proximal to (e.g., upstream or downstream of) themodification. The region proximal to the modification can, in certainembodiments, correspond to the region of the template complementary tothe portion of the nascent strand synthesized while the footprint of thepolymerase overlapped the locus of the modification. These analog-baseddifferences in polymerase sensitivity and response can be used inredundant sequencing strategies to further enhance the detection ofmodifications. For example, exchanging nucleotide analogs betweeniterations of an iterative sequencing reaction elicits changes inpolymerase activity between the iterations. Statistical analysis of thedifferences in the sequencing reads from each iteration combined withthe knowledge of how each type of nucleotide analog affects polymeraseactivity can facilitate identification of modifications present in thereaction. FIG. 1 provides illustrative examples of various types ofreaction data in the context of a pulse trace including IPD, pulse width(PW), pulse height (PH), and context. FIG. 1A illustrates these reactiondata on a pulse trace generated on an unmodified template, and FIG. 1Billustrates how the presence of a modification (5-MeC) can elicit achange in one of these reaction data (IPD) to generate a signal(increased IPD) indicative of the presence of the modification.

Similarly, reaction data in single molecule sequencing reactions usingnanopore sensors is generally centered around characterization ofchanges in current through the nanopore itself, e.g., the rate at whichcurrent fluctuations indicate a “next” base in the template is in thedetection region of the pore, as well as the magnitude of the currentfluctuations themselves as well as all aspects of noise that aremeasured during those fluctuations. For example, metrics important fornanopore-based sequencing include absolute current flow at a given time,a duration of a given current flow measure, a duration of time thecurrent measure indicates the detection region does not comprise anucleobases, a duration of time between current measures that areindicative of a particular base, a magnitude of change from a base linecurrent measure, a magnitude of change from a prior of subsequentcurrent measure, and the like. Further, where a protein (e.g., helicaseor polymerase) is used to drive the template through the nanopore,interactions of the protein with the template prior to its entry intothe pore can affect the kinetics of sequence detection of the portion ofthe template within the detection region of the pore. In this way, anupstream modified nucleobase can affect detection of an unmodified basewithin the nanopore.

In yet further embodiments, reaction data is generated by analysis ofthe collected signals (e.g., fluorescence emissions, currentfluctuations, etc.) to determine error metrics for the reaction. Sucherror metrics include not only raw error rate, but also more specificerror metrics, e.g., identification of signals that did not correspondto a nucleobases within the template nucleic acid. For example, forsequencing reactions that involve template-directed nascent strandsynthesis using fluorescently labeled nucleotides, this can includeidentification of pulses that did not correspond to an incorporationevent (e.g., due to “sampling”), incorporations that were notaccompanied by a detected pulse, incorrect incorporation events, and thelike. Any of these error metrics, or combinations thereof, can serve asa signal indicative of the presence of one or more modifications in thetemplate nucleic acid. In some embodiments, such analysis involvescomparison to a reference sequence and/or comparison to replicatesequence information from the same or an identical template, e.g., usinga standard or modified multiple sequence alignment. Certain types ofmodifications cause an increase in one or more error metrics. Forexample, some modifications can be “paired” with more than one type ofincoming nucleotide or analog thereof, so replicate sequence reads forthe region comprising the modification will show variable baseincorporation opposite such a modification. Such variable incorporationis thereby indicative of the presence of the modification. Certain typesof modifications cause an increase in one or more error metrics proximalto the modification, e.g., immediately upstream or downstream. The errormetrics at a locus or within a region of a template are generallyindicative of the type of modification(s) present at that locus or inthat region of the template, and therefore serve as a signal of suchmodification(s). In preferred embodiments, at least some reaction datais collected in real time during the course of the reaction, e.g., pulseand/or trace characteristics. Similarly, error metrics specific fornanopore-based sequencing can be useful for identifying modifications intemplates passed through a nanopore sensor.

Although described herein primarily with regards to fluorescentlylabeled nucleotides, other types of detectable labels and labelingsystems can also be used with the methods, compositions, and systemsdescribed herein including, e.g., quantum dots, surface enhanced Ramanscattering particles, scattering metallic nanoparticles, FRET systems,intrinsic fluorescence, non-fluorescent chromophores, and the like. Suchlabels are generally known in the art and are further described inProvisional U.S. Patent Application No. 61/186,661, filed Jun. 12, 2009;U.S. Pat. Nos. 6,399,335, 5,866,366, 7,476,503, and 4,981,977; U.S.Patent Pub. No. 2003/0124576; U.S. Ser. No. 61/164,567; WO 01/16375;Mujumdar, et al Bioconjugate Chem. 4(2):105-111, 1993; Ernst, et al,Cytometry 10:3-10, 1989; Mujumdar, et al, Cytometry 10:1119, 1989;Southwick, et al, Cytometry 11:418-430, 1990; Hung, et al, Anal.Biochem. 243(1):15-27, 1996; Nucleic Acids Res. 20(11):2803-2812, 1992;and Mujumdar, et al, Bioconjugate Chem. 7:356-362, 1996; IntrinsicFluorescence of Proteins, vol. 6, publisher: Springer US, ©2001;Kronman, M. J. and Holmes, L. G. (2008) Photochem and Photobio 14(2):113-134; Yanushevich, Y. G., et al. (2003) Russian J. Bioorganic Chem29(4) 325-329; and Ray, K., et al. (2008) J. Phys. Chem. C 112(46):17957-17963, all of which are incorporated herein by reference in theirentireties for all purposes. Many such labeling groups are commerciallyavailable, e.g., from the Amersham Biosciences division of GEHealthcare, and Molecular Probes/Invitrogen Inc. (Carlsbad, Calif.).,and are described in ‘The Handbook—A Guide to Fluorescent Probes andLabeling Technologies, Tenth Edition’ (2005) (available from Invitrogen,Inc./Molecular Probes and incorporated herein in its entirety for allpurposes). Further, a combination of the labeling strategies describedherein and known in the art for labeling reaction components can beused.

International Patent Application No. PCT/US2011/060338 (filed Nov. 11,2011, and incorporated herein by reference in its entirety for allpurposes) provides various additional strategies, methods, compositions,and systems for detecting modifications in a nucleic acid, e.g., duringreal-time nascent strand synthesis. For example, since DNA polymerasescan typically bypass 5-MeC in a template nucleic acid and properlyincorporate a guanine in the complementary strand opposite the 5-MeC,additional strategies are desired to detect such altered nucleotides inthe template. Various such strategies are provided herein, such as,e.g., a) modification of the polymerase to introduce an specificinteraction with the modified nucleotide; b) detecting variations inenzyme kinetics, e.g., pausing, retention time, etc.; c) use of adetectable and optionally modified nucleotide analog that specificallybase-pairs with the modification and is potentially incorporated intothe nascent strand; d) chemical treatment of the template prior tosequencing that specifically alters 5-MeC sites in the template; e) useof a protein that specifically binds to the modification in the templatenucleic acid, e.g., delaying or blocking progression of a polymeraseduring replication; and f) use of sequence context (e.g., the higherfrequency of 5-MeC nucleotides in CpG islands) to focus modificationdetection efforts on regions of the template that are more likely tocontain such a modification (e.g., GC-rich regions for 5-MeC detection).These strategies may be used alone or in combination to detect 5-MeCsites in a template nucleic acid during nascent strand synthesis, andare contemplated for use with the methods and compositions providedherein.

Various different polymerases may be used in template-directed sequencereactions, e.g., those described at length, e.g., in U.S. Pat. No.7,476,503, the disclosure of which is incorporated herein by referencein its entirety for all purposes. In brief, the polymerase enzymessuitable for the present invention can be any nucleic acid polymerasesthat are capable of catalyzing template-directed polymerization withreasonable synthesis fidelity. The polymerases can be DNA polymerases orRNA polymerases (including, e.g., reverse transcriptases), DNA-dependentor RNA-dependent polymerases, thermostable polymerases or thermallydegradable polymerases, and wildtype or modified polymerases. In someembodiments, the polymerases exhibit enhanced efficiency as compared tothe wildtype enzymes for incorporating unconventional or modifiednucleotides, e.g., nucleotides linked with fluorophores. In certainpreferred embodiments, the methods are carried out with polymerasesexhibiting a high degree of processivity, i.e., the ability tosynthesize long stretches (e.g., over about 10 kilobases) of nucleicacid by maintaining a stable nucleic acid/enzyme complex. In certainpreferred embodiments, sequencing is performed with polymerases capableof rolling circle replication. A preferred rolling circle polymeraseexhibits strand-displacement activity, and as such, a single circulartemplate can be sequenced repeatedly to produce a sequence readcomprising multiple copies of the complement of the template strand bydisplacing the nascent strand ahead of the translocating polymerase.Since the methods of the invention can increase processivity of thepolymerase by removing lesions that block continued polymerization, theyare particularly useful for applications in which a long nascent strandis desired, e.g. as in the case of rolling-circle replication.Non-limiting examples of rolling circle polymerases suitable for thepresent invention include but are not limited to T5 DNA polymerase, T4DNA polymerase holoenzyme, phage M2 DNA polymerase, phage PRD1 DNApolymerase, Klenow fragment of DNA polymerase, and certain polymerasesthat are modified or unmodified and chosen or derived from the phagesΦ29 (Phi29), PRD1, Cp-1, Cp-5, Cp-7, Φ15, Φ1, Φ21, Φ25, BS 32 L17, PZE,PZA, Nf, M2Y (or M2), PR4, PR5, PR722, B103, SF5, GA-1, and relatedmembers of the Podoviridae family. In certain preferred embodiments, thepolymerase is a modified Phi29 DNA polymerase, e.g., as described inU.S. Patent Publication No. 20080108082, incorporated herein byreference in its entirety for all purposes. Additional polymerases areprovided, e.g., in U.S. Ser. No. 11/645,125, filed Dec. 21, 2006; Ser.No. 11/645,135, filed Dec. 21, 2006; Ser. No. 12/384,112, filed Mar. 30,2009; and 61/094,843, filed Sep. 5, 2008; as well as in U.S. PatentPublication No. 20070196846, the disclosures of which are incorporatedherein by reference in their entireties for all purposes.

In certain aspects, methods and compositions provided herein utilizemodified and/or non-natural (e.g., non-standard or non-cognate)nucleotide analogs and/or base pairing. For example, certain non-naturalnucleotide analogs can be incorporated by a polymerase into a nascentstrand opposite a modification, e.g., missing or damaged base. Incertain embodiments, such non-natural nucleotide analogs are detectablylabeled such that their incorporation can be distinguished fromincorporation of a natural or cognate nucleotide or nucleotide analog,e.g., during template-directed nascent strand synthesis. This strategyallows real-time sequencing that generates reads that not only providebase sequence information for native bases in the template, but alsomodified bases without requiring further modifications to the standardmethods (Eid, et al, supra). This method facilitates modificationprofiling in the absence of repeated sequencing of each DNA template,and is particularly well suited to de novo applications. In certainembodiments, the modified or non-natural nucleotide analogs are notincorporatable into the nascent strand and the polymerase can bypass themodification using a native nucleotide or nucleotide analog, which mayor may not be labeled. Since the modified or non-natural analog has ahigher affinity for the modification than a native analog, it will bindto the polymerase complex multiple times (repeatedly being “sampled” bythe polymerase) before a native analog is incorporated, resulting inmultiple signals for a single incorporation event, and therebyincreasing the likelihood of accurate detection of the modification.Similar methods for sequencing unmodified template nucleic acids aredescribed in greater detail in U.S. Ser. No. 61/186,661, filed Jun. 12,2009; Ser. No. 12/370,472, filed Feb. 12, 2009; Ser. No. 13/032,478,filed Feb. 22, 2011; and Ser. No. 12/767,673, filed Apr. 26, 2010, allof which are incorporated herein by reference in their entireties forall purposes.

Examples of the use of modified, non-natural, non-standard and/ornon-cognate nucleotide analogs and/or base pairing to detectmodifications of a template nucleic acid are detailed in InternationalPatent Application No. PCT/US2011/060338, filed Nov. 11, 2011,previously incorporated herein. For example, since 5-MeC retainsWatson-Crick hydrogen bonding with guanine, a modified guaninenucleotide analog can be used to detect 5-MeC in the template strand,and some guanine nucleotide analogs appropriate for this application arefurther described elsewhere, e.g., in International Application Pub. No.WO/2006/005064 and U.S. Pat. No. 7,399,614. Similar modifications can bemade to nucleotide analogs appropriate for SMRT® sequencingapplications, e.g., those with terminal-phosphate labels, e.g., asdescribed in U.S. Pat. Nos. 7,056,661 and 7,405,281; U.S. Patent Pub.Nos. 20070196846 and 20090246791; and U.S. Ser. No. 12/403,090, all ofwhich are incorporated herein by reference in their entireties for allpurposes. In certain embodiments, 5-MeC detection may be carried outusing a modified guanine nucleotide analog described above that carriesa detectable label that is distinguishable from detectable labels onother reaction components, e.g., other nucleotide analogs beingincorporated. Such a strategy allows 5-MeC detection by observation of asignal, rather than or in addition to altered polymerase kinetics, whichfacilitates methylation profiling even in the absence of redundant orreplicate sequencing of the template.

Certain embodiments use other non-natural base pairs that are orthogonalto the natural nucleobases pairs. For example, isoguanine (isoG) can beincorporated by a polymerase into DNA at sites complementary toisocytosine (isoC) or 5-methylisocytosine (^(Me)isoC), and vice versa,as shown by the following chemical structure and described in A. T.Krueger, et al., “Redesigning the Architecture of the Base Pair: TowardBiochemical and Biological Function of New Genetic Sets.” Chemistry &Biology 2009, 16(3), 242, incorporated herein by reference in itsentirety for all purposes. Other non-natural base pairs that areorthogonal to the natural nucleobases pairs can also be used, e.g.,Im-N^(O)/Im-O^(N), dP/dZ, or A*/T* (described further in Yang, et al.(2007) Nucleic Acids Res. 35(13):4238-4249; Yang, et al. (2006) NucleicAcids Res. 34(21):6095-6101; Geyer, et al. (2003) Structure 11:1485-1498; J. D. Ahle, et al., Nucleic Acids Res 2005, 33(10), 3176; A.T. Krueger, et al., supra; and A. T. Krueger, et al., Curr Opinions inChem Biology 2007, 11(6), 588).

III. Chemical Modification of Template

Direct detection of modifications (e.g., methylated bases as describedabove) without pre-treatment of the DNA sample, has many benefits.Alternatively or additionally, complementary techniques may be employed,such as the use of non-natural or modified nucleotide analogs and/orbase pairing described elsewhere herein. In general, such complementarytechniques serve to enhance the detection of the modification, e.g., byamplifying a signal indicative of the modification. Further, while themethods described herein focus primarily on detection of 5-MeCnucleotides, it will be clear to those of ordinary skill in the art thatthese methods can also be extended to detection of other types ofnucleotide modifications or damage. In addition, since certainsequencing technologies (e.g., SMRT® sequencing) do not requireamplification of the template, e.g., by PCR, other chemicalmodifications of the 5-MeC or other modifications can be employed tofacilitate detection of these modified nucleotides in the template,e.g., by employing modifying agents that introduce additionalmodifications into the template at or proximal to the modifiednucleotides. For example, the difference in redox potential betweennormal cytosine and 5-MeC can be used to selectively oxidize 5-MeC andfurther distinguish it from the nonmethylated base. Such methods arefurther described elsewhere, and include halogen modification (S.Bareyt, et al., Angew Chem Int Ed Engl 2008, 47(1), 181) and selectiveosmium oxidation (A. Okamoto, Nucleosides Nucleotides Nucleic Acids2007, 26(10-12), 1601; and K. Tanaka, et al., J Am Chem Soc 2007,129(17), 5612), and these references are incorporated herein byreference in their entireties for all purposes.

By way of example, DNA glycosylases are a family of repair enzymes thatexcise altered (e.g., methylated), damaged, or mismatched nucleotideresidues in DNA while leaving the sugar-phosphate backbone intact.Additional information on glycosylase mechanisms and structures isprovided in the art, e.g., in A. K. McCullough, et al., Annual Rev ofBiochem 1999, 68, 255. In particular, four DNA glycosylases (ROS1, DME,DML2, and DML3) have been identified in Arabidopsis thaliana that removemethylated cytosine from double-stranded DNA, leaving an abasic site.(See, e.g., S. K. Ooi, et al., Cell 2008, 133, 1145, incorporated hereinby reference in its entirety for all purposes.) In addition, methods forthe use of glycosylases for detection of various types of DNA damage aredescribed in U.S. Ser. No. 61/186,661, filed Jun. 12, 2009 andincorporated herein by reference in its entirety for all purposes.Furthermore, it has been shown that a 5′-triphosphate derivative of thepyrene nucleoside (dPTP) is efficiently and specifically inserted bycertain DNA polymerases into abasic DNA sites through stericcomplementarity. (See, e.g., T. J. Matray, et al., Nature 1999,399(6737), 704, incorporated herein by reference in its entirety for allpurposes.)

In certain embodiments of single-molecule, five-color DNA methylationsequencing, DNA glycosylase activity can be combined with polymeraseincorporation of a non-natural nucleotide analog (e.g., a pyrene analog(dPTP)) covalently linked to a “fifth label” that is detectably distinctfrom labels on the other nucleotide analogs in the reaction mixture.Further, error metrics can also be used to identify the modification,e.g., an increase in binding events for the pyrene analog may occur atthe abasic site, as well as at downstream positions as the incorporatedpyrene analog is “buried” in the nascent strand during subsequentincorporation events. In certain embodiments, a non-hydrolyzable pyreneanalog carrying a detectable label is used at a concentration sufficientto potentially bind (and be detected) several times at the abasic sitebefore a hydrolyzable (and, preferably, distinctly labeled) analog isincorporated. Alternatively, or in addition, the non-hydrolyzable analogcan display an increased residence time and, therefore, lengthen theemitted signal indicative of the presence of the particular lesion ofinterest. Typically, a non-hydrolyzable fifth-base is eventuallydisplaced by a hydrolyzable analog and synthesis of the nascent strandcontinues.

In other embodiments of single-molecule, five-color DNA methylationsequencing, DNA glycosylase activity can be combined with addition of anon-natural base (e.g., an otherwise modified cytosine) to replace themethylated base. Briefly, after glycosylase-catalyzed excision of 5-MeC(with or without cleavage of the phosphodiester backbone), a class I orclass II AP endonuclease is added to remove the abasic ribose derivativeby cleavage at the phosphate groups 3′ and 5′ to the abasic site,thereby leaving 3′-OH and 5′-phosphate termini. A polymerase capable ofextending from the free 3′-OH (e.g., Pol I or human pol β) and anon-natural base (e.g., isoC, isoG, or MeisoC) are added to incorporatethe non-natural base into the abasic site. A DNA ligase (e.g., LigIII)is added to close the phosphodiester backbone by forming covalentphosphodiester bonds between the free 3′-OH and 5′-phosphates via ATPhydrolysis. Finally, a processive polymerase (e.g., Φ29 DNA polymerase)is used to synthesize a nascent nucleic acid strand complementary to thetemplate strand, where the fifth nucleotide analog is the complement ofthe non-natural base that replaced 5-MeC in the template. For example,if the replacement base was isoC or MeisoC, then the fifth analog wouldbe isoG. As such, the fifth analog would only incorporate into thenascent strand at positions complementary to 5-MeC sites in the templatenucleic acid. In preferred embodiments, the fifth analog has adetectable label (e.g., fluorescent dye) that is distinct from labels onother reaction components, e.g, detectable labels on other nucleotideanalogs in the reaction mixture. Further, in certain embodiments, anon-natural or altered nucleotide that can base pair with one of thefour nucleotide analogs, e.g., A, G, C, or T, can be used to replace theexcised base in the template. In such embodiments no fifth fluorophoreis required, and the non-natural or altered nucleotide in the templateis detected primarily by virtue of the polymerase behavior duringtemplate-directed synthesis, as described at length elsewhere herein.This is particularly beneficial where the presence of the excised basecauses a smaller response by the polymerase than the presence of thebase with which it is replaced. As such, by removing the initialmodified base and replacing it with a different modified base at which apolymerase has a more distinct or extreme kinetic signature, thepractitioner enhances detection of the modified locus in the template.

In certain embodiments, the template may be modified by treatment withbisulfite. Bisulfite sequencing is a common method for analyzing CpGmethylation patterns in DNA. Bisulfite treatment deaminates unmethylatedcytosine in a single-stranded nucleic acid to form uracil (P. W. Laird,Nat Rev Cancer 2003, 3(4), 253; and H. Hayatsu, Mutation Research 2008,659, 77, incorporated herein by reference in their entireties for allpurposes). In contrast, the modified 5-MeC base is resistant totreatment with bisulfite. As such, pretreatment of template DNA withbisulfite will convert cytosines to uracils, and subsequent sequencingreads will contain guanine incorporations opposite 5-MeC nucleotides inthe template and adenine incorporations opposite the uracil (previouslyunmethylated cytosine) nucleotides. Treatment of the single-strandedmolecule with bisulfite is followed by single-molecule sequencing thatprovides a sequence read for both strands of the original templatenucleic acid. Comparison of the resulting sequence reads for each strandof the double-stranded nucleic acid will identify positions at which anunmethylated cytosine was converted to uracil in the original templatessince the reads from the two templates will be non-complementary at thatposition (A-C mismatch). Likewise, reads from the two templates will becomplementary at a cytosine position (G-C match) where the cytosineposition was methylated in the original template. In certain preferredembodiments, a circular template is used, preferably having regions ofinternal complementarity that can hybridize to form a double-strandedregion, e.g., as described in U.S. Ser. No. 12/383,855 and U.S. Ser. No.12/413,258, both filed on Mar. 27, 2009, and both incorporated herein byreference in their entireties for all purposes. Further, since preferredsequencing methods also provide reaction data apart from sequence data,detection of a change in kinetics of the sequencing reaction (e.g., IPDor pulse width) is also used to determine whether or not a position wasalways a T or is a U that was originally an unmethylated cytosine.

In yet further embodiments, a template nucleic acid is exposed to areagent that transforms a modified nucleotide to a different nucleotidestructure. For example, a bacterial cytosine methyl transferase converts5-MeC to thymine (M. J. Yebra, et al., Biochemistry 1995, 34(45), 14752,incorporated herein by reference in its entirety for all purposes).Alternatively, the reagent may convert a methyl-cytosine to5-hydroxy-methylcytosine, e.g., the hydroxylase enzyme TET1 (M.Tahiliani, et al., Science 2009, 324(5929), 930, incorporated herein byreference in its entirety for all purposes). In further embodiments, thereagent may include a cytidine deaminase that converts methyl-cytosineto thymine (H. D. Morgan, et al., J Biological Chem 2004, 279, 52353,incorporated herein by reference in its entirety for all purposes). Inyet further embodiments, a restriction enzyme that specifically alters amodification of interest can be used to create a lesion at themodification site. For example, DPNI cleaves at a recognition sitecomprising methyladenosine. Optionally, the cleaved template could berepaired during an analytical reaction by inclusion of a ligase enzymein the reaction mixture. As noted elsewhere herein, nucleotides otherthan 5-MeC can also be modified and detected by the methods providedherein. For example, adenine can be converted to inosine throughdeamination, and this conversion affected by methylation of adenine,allowing differential treatment and detection of adenine and MeA. Thekinetic signatures of the resulting altered nucleotide during sequencingcan be used to determine the nucleotide sequence of the originaltemplate nucleic acid.

Another modified base that can be detected using the methods providedherein is 5-hydroxymethylcytosine (5-hmC). Conventional bisulfitesequencing does not effectively distinguish 5-hmC from 5-MeC because5-hmC tends to remain unmodified like 5-MeC. However, in certainembodiments, bisulfite conversion in combination with real-timesingle-molecule sequencing can be used in methods for distinguishing5-MeC from 5-hydroxymethylcytosine (5-hmC). As noted above, bisulfiteconversion changes cytosine into uracil and does not change 5-MeC.Bisulfite conversion also changes hydroxymethyl-cytosine (5-hmC) tocytosine-5-methylenesulfonate (CMS), which contains a bulky SO₃ adductin place of the OH adduct of 5-hmC. Like methyl-cytosine, CMS base-pairswith guanine. A template so treated is subsequently subjected to asingle-molecule template-directed sequencing reaction. The uracilspresent in the template (due to bisulfite conversion) can bedistinguished from thymines using polymerase behavior, e.g., interpulseduration, pulse width, frequency of cognate sampling, accuracy ofpairing, etc. If a complementary strand is also subjected to sequencing,then the complementary nucleotide sequence information can also be usedto identify bases, as described above. Further, the SO3 adduct addedduring the conversion of 5-hmC to CMS will enhance the response of thepolymerase to the modified base (e.g., causing increased pausing) andthereby facilitate identification of CMS versus 5-MeC in the template.As such, in certain embodiments a nucleic acid sample is treated withbisulfite and sequenced. U is discriminated from T based on polymerasekinetics and standard bisulfite sequencing algorithms, with those basesdetected as U known to have originally been C. Bases detected as C basedon their base-pairing with G are known to be 5-MeC or CMS (originally5-hmC). 5-MeC and CMS are discriminated based upon their relativelydifferent kinetics, due at least in part to the SO3 adduct present inCMS and absent in 5-MeC.

In yet further embodiments, the template can be further subjected toelectrophile modification of 5-hmC, e.g., by addition of a bulky groupthat facilitates its detection and discrimination from 5-MeC andunmodified cytosine. Preferred methods include those described in theart, e.g., in Merino, et al. (2005) J. Am. Chem. Soc. 127: 4223-4231,Petrov, A. I. (1980) Nuc. Ac. Res. 8(23):5913-5929; Petrov, et al.(1980) Nuc. Ac. Res. 8(18):4221-4234; and Kamzolova, S. G. (1987)Biokhimiia 52(9):1577-82, the disclosures of which are incorporatedherein by reference in their entireties for all purposes.) Addition of abulky group at the OH group of 5-hmC alters the kinetics of the DNApolymerase-mediated incorporation of a nucleoside into a nascent strandopposite the modified 5-hmC, and this alteration facilitates detectionand mapping of the 5-hmC within a template nucleic acid. These and otherelectrophilic compounds known in the art can be used similarly to thosedescribed above to add bulky adducts to nucleic acids and, thereby,provide a characteristic kinetic signature during single moleculesequencing reactions that is indicative of the presence of a given baseso modified.

Glucosyltransferases can also be used to modify a template nucleic acidprior to sequencing, e.g., to enhance kinetic or other reaction data tofacilitate detection of modified bases. For example, DNAglucosyltransferases can be used to transfer a glucose group to 5-hmC toenhance detection of this modified base. Exemplary enzymes fortransferring glucose groups to hmC include, but are not limited to,T2-hmC-α-glucosyltransferase, T4-hmC-α-glucosyltransferase,T6-hmC-α-glucosyltransferase, and T2-hmC-β-glucosyltransferase. Otherenzymes can be used to create diglucosylated hmC, such asT6-glucosyl-hmC-β-glucosyltransferase, which creates diglucosylated hmCwith a β linkage between the two glucose groups. These enzymes aregenerally specific for hmC and do not typically alter other bases suchas A, C, MeC, T, or G. As such, treating hmC-containing nucleic acidswith such enzymes creates nucleic acids in which the hmC residues havebeen converted to monoglucosylated-hmC or multi-glucosylated-hmC.Glucosylated-hmC is much larger and bulkier than hmC, and therefore hasa distinctive effect on polymerase activity when present in a templatenucleic acid. Details on the glucosylation of 5-hmC byglucosyltransferases are known in the art, e.g., in Josse, et al. (1962)J. Biol. Chem. 237:1968-1976; and Lariviere, et al. (2004) J. Biol.Chem. 279:34715-34720.

The strategy for addition of glucose moieties to hmC described above canbe modified in various ways. For example, (a) the glucose adducts addedcould comprise a detectable label to provide another mode of detection,e.g., in addition to monitoring the kinetics of the reaction, (b) anaffinity tag could be linked to a glucose adduct for subsequentpurification, and/or (c) further steps can be performed to addmodifications in addition to the glucose adduct, e.g., that are linkedto the nucleic acid through the glucose adduct. In yet furtherembodiments, a glucosyltransferase enzyme can be used that binds to thetemplate but does not dissociate, and therefore results in a furthermodification (e.g., bound agent) that can be detected duringsingle-molecule sequencing, e.g., by detection of a significant pause ofnascent strand synthesis.

In further embodiments, both hmC and 5-MeC and be modified prior tosequencing. For example, the nucleic acid can be subjected toglucosylation to convert hmC to glucose-hmC, and subsequently the 5-MeCbases can be converted to hmC, e.g., using TET1 protein. Detection ofglucose-hmC will be indicative of an hmC in the original nucleic acid,and detection of hmC will be indicative of a 5-MeC in the originalnucleic acid. Alternatively, the hmC generated by conversion of 5-MeCcan be further modified to produce a greater enhancement of detectionwhile maintaining a signal distinct from that of the glucose-hmCgenerated by conversion of hmC in the original nucleic acid.Alternatively or additionally, different sugar groups can be added toeach, e.g., selected from glucose, maltose, sucrose, lactose, galactose,or multiples (e.g., di- or tri-glucosyl (or other sugar) groups) orcombinations thereof.

Although the disclosures above focused on introducing additionalmodifications to nucleotides that already comprised a modification,modifications can also be introduced to nucleotides that are initiallyunmodified, e.g., where it is desirable to enhance the detection of agiven base by adding a kinetic signature. A modification can be addedthat results in the kinetic signature during a sequencing reaction. Forexample, all unmodified cytosines in a template nucleic acid can beconverted to 4-MeC prior to sequencing. Any 5-MeC will not be converted.During sequencing the kinetic signatures generated at each cytosine willindicate whether the cytosine is 4-MeC (originally unmodified C) or5-MeC (5-MeC in the original nucleic acid). Alternatively, one canemploy a methyltransferase under deamination conditions as described inSharath et al. ((2000) Biochemistry 39(47):14611-14616) to deaminatecytosine bases while leaving 5-MeC unaltered. The deaminated cytosinebases, now uracils, would be “read” as thymine, and would bedistinguishable from native thymines due to the sequence of thecomplementary strand, which would have a G rather than an A base in thecomplementary position. It will be appreciated that these examples arenonlimiting, and that other base modifications can be introduced tounmodified bases in a template to alter or enhance a kinetic signaturewithin a sequencing reaction.

Further, while certain aspects of the methods provided herein benefitfrom the absence of amplification, which in many instances results inloss of base modifications in the resulting replicons, there are methodsof amplification that preserve such modifications, thereby allowingfaithful replication of the modified bases present in the originalnucleic acid to provide replicons having the same modified bases. Forexample, a nucleic acid of interest can be cloned into a vector andtransfected into a host (e.g., bacterial or other cell line) that willreplicate the vector, and therefore the nucleic acid therein, andmaintain the modifications, e.g., through the use of modificationenzymes either naturally expressed by the host, or that the host hasbeen engineered to express. After replication, the vector can beextracted from the host and the nucleic acid of interest excised andsubjected to sequencing. Alternatively, the replicons can be treated invitro to maintain modifications. Hemimethylated nucleic acids areproduced from semiconservative replication of a fully methylated duplexto produce two daughter duplexes, each having one methylated strand andone unmethylated strand. Certain methyltransferase enzymes bind tohemimethylated nucleic acids and methylate the strand lackingmethylation, thereby restoring the methylation status of the originalnucleic acid molecule. Essentially any method for amplifying orreplicating a modified nucleic acid that maintains the modification inthe daughter nucleic acids can be used with the methods herein.

IV. Analysis of Genomic Samples

Epigenetic modifications on a chromosome are known to influenceexpression of genes that also map to that chromosome. For example,methylation or other modifications is promotor regions or repeat regionshave been shown to affect gene expression and/or subsequentpost-transcriptional modification, e.g., splicing of the mRNAtranscripts. Alternative splicing patterns can result in the productionof aberrant polypeptide products, and can thereby be the diseasemechanism in certain disorders, such as repeat expansion disorders. Assuch, it is of interest to be able to map the chromosomal locations ofepigenetic modifications, including both CpG and non-CpG methylation,hydroxymethylation, and others described herein, with detection of thesemodifications serving as a prognostic or diagnostic for certaindisorders, e.g., to inform as to the susceptibility or resistance of anindividual to such disorder, the expected severity of the disorder, theexpected age-of-onset of the disorder, and/or preferred theranosticstrategies that could prevent or lessen the severity of the disorder.

The pattern of modifications across a plurality of homologouschromosomes can be determined and associated with phenotypic traits,much as a sequence variant is analyzed. For simple organisms having onlya single chromosome, this can be done relatively simply, e.g., using along-read sequencing technology that can provide both sequence andmodification data for the chromosome, e.g. using the methods describedin International Patent Application No. PCT/US2011/060338, filed Nov.11, 2011. With a genome having only a single chromosome, all sequencereads are reasonably mapped on the same chromosome. The mapping problemcan become more difficult for organisms that have multiple differenttypes of chromosomes, but typically polynucleotide sequence differencesbetween the different types of chromosomes are indicative of whichchromosome produced a given sequence read, as long as the sequence readis long enough to have a unique locus in the genome. Further, there aremethods for selecting and isolating a single chromosome prior tosequencing, which disambiguates the chromosomal source of the resultingreads, but such isolation is not usually perfect, so there can still bereads generated from contaminating nucleic acids. Further, the samplelosses that accompany such a procedure may not be tolerable, e.g., wherea genomic sample is in short supply. Further complicating genomicsequence mapping are polyploid organisms, which have a plurality of asame chromosome type, termed “homologous chromosomes,” in a single cell.Since the polynucleotide sequences of two homologous chromosomes arevery similar (typically nearly identical), it can be difficult to mapsequence reads from fragments of the homologous chromosomes to onehomolog or the other using polynucleotide sequence data alone, since agiven read may actually be identical to regions of both homologs.

For example, humans have chromosomes numbered from 1 to 22 (autosomes),plus X and Y “sex chromosomes,” and as diploid organisms they have twoof each type of autosome (1-22) and either an XX or XY pair for a totalof 46 individual chromosomes per somatic cell. During gametogenesis, thediploid genome is reduced so that each haploid gamete produced (egg orsperm) contains only one of each autosome and one sex chromosome. Duringfertilization, the paternal set of nuclear chromosomes in a sperm cellare released into an egg cell where they pair with the maternal set ofnuclear chromosomes in the egg, thereby restoring the diploid number ofnuclear chromosomes. This fertilized egg can then begin cell divisionleading to the development of an offspring that comprises pairs of“homologous” chromosomes in each cell, each pair comprising amaternally-derived chromosome and a paternally-derived chromosome of thesame chromosome type. Chromosomes that are homologous to one anothertypically comprise the same genes, regulatory regions, and other majorchromosomal characteristics in the same locations along their lengths,but these genes and other sequences are not necessarily identicalbetween the two chromosomes and there are often small differencesbetween their polynucleotide sequences (different alleles, SNPs,insertions, deletions, etc.), and the presence and the locations ofmodifications including but not limited to modified bases (e.g.,methylation, hydroxymethylation, etc.), bound agents (transcriptionfactors, histones, etc.), and secondary structures (hairpin structures,etc.). Epigenetic modifications are known to affect chromosome function,e.g., gene expression and regulation, and since these effects typicallyare specific to loci on the same chromosome as the epigeneticmodifications, it is of interest to determine which polynucleotidesequences (e.g., which allele of a gene of interest) share a chromosomewith these and other modifications. In doing so, a haplotype for eachchromosome is determined that comprises both sequence and modificationinformation, providing a rich data set that can distinguish betweensequence reads from the maternally-derived and paternally-derivedhomologs. Detection of modified bases, bound agents, secondarystructures, and other nucleic acid modifications is further detailed inInternational Patent Application No. PCT/US2011/060338, filed Nov. 11,2011, incorporated herein by reference in its entirety for all purposes.

Since human chromosomes are very large and difficult to handle in onepiece, they are typically fragmented before sequencing, which canseparate an epigenetic modification from a genomic region whose functionis regulated by the modification. As such, sequencing reads must beassembled together in order to reconstruct the sequence of the originalchromosome, i.e., to ensure that the alleles in cis with one another areall mapped to the same homolog, and those in trans with one another aremapped to different homologs. Sequencing methods that produce longsequencing reads make this assembly easier, at least in part because thefragments being sequenced are larger, but it is still a significantchallenge to distinguish between polynucleotide sequences from twohomologous chromosomes having very similar sequences, even where allelicdifferences exist. In certain aspects, the invention provides a methodof sequencing homologous chromosomes in the same sequencing reactionmixture, where individual template nucleic acid molecules are subjectedto single-molecule, real-time sequence analysis that provides both basesequence data and kinetic data, e.g., based on the real-time kinetics ofthe sequencing reaction itself. The modification data informs as towhich homologous chromosome a given polynucleotide sequence read (andthe alleles therein) maps, allowing construction of the homologouschromosome sequence and identification of which alleles are in cis withthe modifications present. In some embodiments, a modification is knownto be on only one of a pair of homologous chromosomes, and the detectionof the kinetic signature of the modification within a sequence read isused to assign the sequence read to the correct homolog. In otherembodiments, the detection of kinetic signatures of a modification isused to facilitate assembly of sequence reads into a contig by providingelements other than base identity to use in construction a multiplesequence alignment. The use of kinetic data is applicable to bothresequencing and de novo sequencing applications, and generally providesa richer data set for a wide variety of subsequent analyses.

In a preferred embodiment, a sample comprising genomic DNA is treated tofragment the genomic DNA, and the resulting genomic fragments, whichinclude pairs of homologous chromosomes, are included in a singlereaction mixture. The genomic fragments are subjected to asingle-molecule, real-time sequencing reaction that produces bothpolynucleotide sequence data and modification data within a singlesequencing read for a single fragment molecule. A sequencing read isgenerated for each fragment molecule sequenced, and the polynucleotidesequence data and modification data are analyzed to assign the sequencereads to a particular homologous chromosome. As such, the assembly ofthe sequence reads depends not only on the polynucleotide sequence datain each read, but also on the types and patterns of modificationsdetected within the same read. These modifications essentially act asadditional base types that increase the complexity of the sequence readsgenerated as compared to sequence reads employing only four base types.The increased complexity simplifies mapping of those reads to thecorrect homolog since the more complex sequencing reads are unlikely tomap to both homologs. In preferred embodiments, the genomic DNAfragments are not chemically modified prior to sequencing, e.g., bytreatment with bisulfite, and the modified and unmodified nucleotides inthe genomic fragments are detected in their native form. In this way,modification detection facilitates determination of the homologouschromosome source of a genomic DNA fragment and subsequent assembly ofsequencing reads to construct the homologous chromosome sequences.

In an aspect of the invention, there is an algorithm loaded into adigital computer that contains DNA sequence data and the associatedchemical modification information pertaining to the DNA. When there areexploitable differences in modification with position, these can beidentified by the use of a clustering algorithm. The algorithm would beapplied through a sequence of the following steps. First, a provisionalassembly of the data would be conducted that would identify contigs withhigh confidence that they are correct. In one implementation, theassembly algorithm is configured to include as much as possible of anyrepeat that may be present at the ends. In a clonal organism, thesecontigs will nearly all terminate in such repeats whose length is longerthan the read length of the sequencing system. A table of repeatsequences is constructed by performing an overlap analysis of thesequences at the ends of the contigs. Selected for the table is anysequence that occurs in more than one contig end and is identical for atleast n bases, where n could be half the average read length, theaverage read length, twice the average read length or three times theaverage read length. All of the reads that have any portion mapping toany entry of the table are then mapped using an alignment algorithm,e.g., BLASR(http://www[dot]pacbiodevnet[dot]com/SMRT-Analysis/Algorithms/BLASR) orthe Burrows-Wheeler Aligner (BWA; bio-bwa[dot]sourceforge[dot]net/),allowing every read to map to any location that passes a significancethreshold without regard for better or worse mapping to other locations.For each of the table entries, the reads are then clustered according tothe kinetic information, using each of the interpulse duration (IPD)values (and if desired also the pulse duration values) as a dimension ina high-dimensional vector. For the sake of definiteness, in someembodiments a window of length W is chosen for each valid startingposition within the table entries reads that span the entire window areselected and each IPD and/or pulse width value within the window is usedas a dimension in the W-dimensional (or 2*W in the case of IPD and pulsewidth) space. These vectors are then subjected to a cluster analysis byany of a number of methods know in the art (examples include K-meansclustering, expectation maximization, DBSCAN, OPTICS, and others) todetermine if there are resolvable clusters within the population ofreads. For a contig fragment of length N this is repeated N-W times. Theclusters are in the ith window are associated with the clusters in thei+1th window by looking for common membership of reads. For example, ifthere are two clusters A and B in the ith window and there are twoclusters C and D in the i+1th cluster, then if 95% of the members of Aare found in C and 95% of the members of B are found in D then A and Care identified as coming from the same genomic location and similarly Band D. This is conducted across all of the valid windows within eachcontig. Next, a subset of the windows is chosen (that for example, couldcollectively cover all of every base of every table entry). This subsetis similarly associated with each other by joint membership above athreshold value (which can be static or dynamically assigned). Theresulting table of associations represents a graph which will thenresolve assembly ambiguities in a otherwise unresolvable graph. This isdone by identifying windows that contain at least a certain amount ofunique sequence, and associating those windows with the contigs from theassembly, then a string of window associations can be followed from oneto the next to lead to another contig. In cases where sufficient kineticdifferences exist between different versions of sequence repeats, thesestrings of association will have no branches and thus there will be noremaining ambiguities.

In some embodiments, the types or patterns of modifications present onone chromosome are known to be absent from its homolog. For example, inmammalian females the process of dosage compensation causes inactivationof one X chromosome (to produce a “Barr body”) while the other Xchromosome remains active. X-inactivation is a process by which one Xchromosome is (mostly) inactivated by packaging it intotranscriptionally inactive heterochromatin, which includeshypermethylation of the DNA as well as other structural modifications.Additional information on X-inactivation is well known in the art, e.g.,see the New South Wales government's Health Centre for GeneticsEducation website atwww[dot]genetics[dot]edu[dot]au/Information/Genetics-Fact-Sheets.Although X-inactivation has been studied for over half a century, manyquestions remain. The mechanisms by which a cell counts and silences Xchromosomes, and maintains the silenced state is not fully understood,nor is the role of noncoding RNAs, and the fact that not all genes aresilenced within an “inactive” X chromosome. Therefore, the methodsprovided herein can be sued to analyze sequence reads displayinghypermethylation characteristic of X-inactivation and assign them to theX chromosome homolog packaged into the Barr body while those sequencereads not exhibiting such hypermethylation are mapped to the active Xchromosome homolog. Since inactivation in marsupials applies exclusivelyto the paternally derived X chromosome, sequence reads that showX-inactivation-related hypermethylation are assigned to the paternalhomologous X chromosome, while those that do not are assigned to thematernal homologous X chromosome. Since X-inactivation is random inother mammals, such as humans, detection of hypermethylation alonecannot assign a given sequence read to either the maternal or paternalchromosome. However, once an X chromosome is inactivated it will remaininactive throughout the lifetime of the cell and this inactivation ismaintained in its daughter cells, so within a population of daughtercells produced from a progenitor cell after X-inactivation, detection ofX-inactivation-specific hypermethylation can distinguish between the twohomologous chromosomes. In addition, the study of X-inactivation mayalso provide insight into cancer biology, since many human breast andovarian tumors have two active X chromosomes (Liao, et al. (2003) CancerInvestigation 21, 641-658). Yet further, at least one study has foundextensive variability in the expression of X-linked genes in females(Carrel, et al. (2005) Nature 434: 400-404), and the methods hereinprovide a simplified strategy for the study of X-inactivation in humansand other organisms that utilize this form of dosage compensation.X-inactivation represents a great model system with which to study abroad range of developmental and epigenetic processes, in particularthose involving stable gene expression regulation without changes to theunderlying DNA sequence. The methods herein provide strategies forstudying these epigenetic mechanisms over long stretches of genomic DNAby allowing simultaneous sequencing and modification detection that cannot only provide haplotype information for a single homolog, but canalso distinguish between maternally-derived and paternally-derivedsequences.

Genomic imprinting is a genetic phenomenon that causesparent-of-origin-specific gene expression. An imprinted allele of a geneis silenced while the non-imprinted allele of the gene is expressed. Insome cases, the non-imprinted allele is inherited from the mother andthe imprinted allele is inherited from the father (e.g., H19 or CDKN1C),and in other cases the non-imprinted allele is inherited from the fatherand the imprinted allele is inherited from the mother (e.g., IGF-2).This monoallelic gene expression is accomplished through the use ofepigenetic modifications including methylation and histonemodifications. Often these modifications are established in the germlineand maintained throughout all somatic cells of an organism, but someimprinted genes display monoallelic gene expression in a tissue-specificmanner. In insects, imprinting has been shown to silence the paternalalleles in males, which is involved in sex determination. Many imprintedgenes are found in clusters, termed “imprinted domains,” suggesting alevel of coordinated control. Common regulatory elements in or nearimprinted domains include non-coding RNAs and differentially methylatedregions. When these regulatory elements control imprinting, they aretermed “imprinting control regions.” Although hypermethylation is oftenassociated with gene silencing, the effect of methylation depends uponthe default state of the region.

Transcriptional profiles have typically been used to identify imprintedgenes, and these require analysis of mRNA transcripts. There are atleast 80 imprinted genes in humans and mice, and there may be many more.The methods herein provide a direct method of identifying and analyzingimprinted genes that does not require purification, sequencing, and/orother analysis of mRNA transcripts. Rather, DNA is analyzed directly toidentify imprinted genes by virtue of simultaneous detection of bothbase sequence data and modification data in sequence reads for singleDNA fragments. For imprinting that is maintained in all somatic tissuesof an organism, this approach can use somatic tissue from anywhere inthe body to detect epigenetic markers consistent with imprinting, anddoes not depend on the level of expression of the non-imprinted alleleas does a transcriptional profiling approach. As such, the ability togenerate sequence information having both base sequence information andmodification information provides the opportunity to analyze genes tonot only detect epigenetic signatures of imprinting, but also todetermine whether the maternal chromosome, paternal chromosome, or bothchromosomes is imprinted. Further, where chromatin is sequenced withoutremoving the histones, the kinetics of sequencing is affected by thepresence and modification status (e.g., methylation, acetylation, and/orphosphorylation) of histones, so even that aspect of imprinting can beinterrogated using the methods described herein and in InternationalPatent Application No. PCT/US2011/060338, filed Nov. 11, 2011. For moreinformation on histone modifications, see, e.g., Fuks, F. (2005) Curr.Op. Genet. & Dev. 15:1-6; and Zhang, et al. (2001) Genes & Dev.15:2343-2360, both of which are incorporated herein be reference intheir entireties for all purposes. The analysis and diagnosis ofdiseases and disorders due to imprinting and/or otherparent-of-origin-dependent expression patterns is contemplated andhaving been linked to a multitude of phenotypes, e.g.,Beckwith-Wiedemann syndrome, Alzheimer disease, mitochondrialdisorders/syndromes, metabolic disorders, autism, bipolar disorder,diabetes, male sexual orientation, aging, obesity, and schizophrenia; aswell as a number of cancers: bladder, breast, cervical, colorectal,esophageal, hepatocellular, lung, mesothelioma, ovarian, prostate,testicular, and leukemia, among others (Falls et al, Genomic Imprinting:Implications for human disease. Am J Pathol 154: 635-47, 1999; Jirtle,Genomic imprinting and cancer. Exp Cell Res 248: 18-24, 1999; Simmons,et al. (2008) Nature Education 1(1); Takasugi, et al. (2010) BMCGenomics 11:481; and Barnes, et al. (2011) Am J Clin Nutr93(4):8975-9005, the disclosures of which are incorporated herein byreference in their entireties for all purposes). Additional informationon imprinting is provided in Sleutels, et al. (2002) Advances inGenetics 46, 11-163.

Pseudogenes are gene-like sequences in a genome that are not expressed,e.g., are not transcribed or their transcripts are not translated. Theyare characterized by their similarity to known genes, and are oftenlabeled as “junk DNA.” They also frequently display methylation patternsthat differ from their active gene counterpart, as shown, e.g., inCortese, et al. (Genomics 91(6):492-502 (2008)), incorporated herein byreference in its entirety for all purposes. As such, the generation ofboth modification data and polynucleotide sequence data in a singlesequencing read provides a means to distinguish a pseudogene sequencefrom an active gene sequence even where the polynucleotide sequence datais similar. For example, in many cases pseudogenes are more heavilymethylated than an active gene sequence, so two sequencing reads havingthe same polynucleotide sequence can be mapped to the pseudogene oractive gene depending on the level of methylation present in the read.

Similarly, certain organisms, and in particular plants, can haveduplicated or “multi-copy” genes that have been shown to have variablemethylation patterns. As such, mapping of sequencing reads from thesemulti-copy genes to a particular copy can be facilitated byconsideration of modifications present in the sequencing reads wherethose modifications distinguish between different copies of the gene inthe genome. Also, patterns of modifications across multi-copy genes, oreven an overall amount of one or more modifications across the multiplecopies can be considered a genomic trait that may be associated with aphenotype of interest. Likewise, the mapping of modifications withinthese multi-copy genes can allow counting of the number of copies tofurther characterize the structure of a genomic region.

Yet further, the ability to provide the long haplotypes comprisingsequence and modification data afforded by a single-molecule sequencingtechnology that collects kinetic information in real time providestremendous benefits to understanding the epigenetic interactions withinthe genome. For example, as noted above the ability to determine whethera first genomic region (having a first determined sequence/modificationstatus) is in cis with a second genomic region region (having a seconddetermined sequence/modification status) allows the genomic regions tobe “phased,” which allows further analysis of whether there regions areinteracting in synergistic or antagonistic ways. The phasing informationprovided by the long sequencing reads combined with modificationdetection greatly facilitates such analyses, allowing not only phasingof sequence-based variants, but also of base modifications within thegenome, both with other base modifications and with sequence-basedvariants.

V. Modification Status as a QTL

Similar to quantitative trait loci (QTL) based upon polynucleotidesequence information, QTL based upon modification data can be associatedwith physical phenotypes including but not limited to health status,drug response, and gene expression. The methods provided hereinfacilitate determination of overall modification status across achromosome or, preferably, an entire genome. For example, where a set ofloci are known to be differentially modified, depending on a givenphenotype of interest, screening those loci for their modificationstatus provides a subset of modified loci, the combination of whichprovides a further aspect to the genotypic contribution to thephenotype. In some instances, the manifestation of the phenotype isdependent upon an absolute amount of modification present, e.g., above acertain threshold. In other instances the specific combination of locicomprising the modification(s) is a key factor. Given the differentialmodification observed in different cells or tissues, the analysis mayfurther comprise a comparison of modification status between differentcells, tissues, developmental stages, etc. Yet further, a systemsgenetics approach can be applied that combines phenotypic measurements,genotype data, modification data, and, optionally expression data, toidentify QTL associated with physical phenotypes. These QTL can then beused to identify loci that are associated with networks of genes thatare functionally interrelated either because they produce proteins thatare part of a common structure or because they are steps in a genepathway. Further details and an example of this method is provided inFang, et al. (2012) Nature Biotechnology 30(12):1232-1239, which isincorporated herein by reference in its entirety for all purposes.

In certain preferred embodiments, the ability to detect basemodifications at single molecule resolution allows the quantitation ofthe number of molecules in a given sample that are modified at aparticular locus. For a given locus, this type of mixture can beestimated for a population to generate a vector of quantitative measuresover every individual in the population. This quantitative trait canthen be correlated with variations in DNA (small nucleotide changes orstructural variation) as a way to map genetic loci that may influencethe degree to which a given site is modified (e.g., a mutation in amethylase could make it more or less effective at targeting a givensite). In combination with other molecular phenotypes like geneexpression or metabolite levels, the causal relationships between basemodifications and their effect on traits like gene expression or higherorder traits like disease can be explored and resolved.

VI. Complex Sample Analysis

Similar to homologous chromosome mapping, sequencing mixtures of relatedmicroorganisms can be difficult, but there is a growing interest instudying microbial metagenomics of populations of microorgansims thatare living together in a specific environment. In order to identify theconstituents of a given sample, researchers can do whole genome shotgunsequencing of all DNA isolated from the sample. Due to the possibilityof large diversity of different organisms, second-generation sequencingmethods are typically employed because of the large number of reads thatare generated. However, because the reads are short, and gene sequencescan often be conserved, it is often very difficult to determine if twoseparate reads come from the same or different organisms. For example,polynucleotide sequencing reads from samples comprising mixtures ofstrains or viral subtypes having similar sequences can be difficult toassign to a particular strain or subtype in the mixture. The goal istypically to assemble reads into large contigs and/or complete genomes,but use of short-read technologies makes this difficult or impossible.

Differences in epigenetic modifications between different bacterial,viral, or fungal strains/subtypes can aid in mapping a given sequenceread to a particular organism. For example, bacteria typically make useof DNA adenine methylation, and this modification is important inbacteria virulence in organisms such as E. coli, Salmonella, Vibrio,Yersinia, Haemophilus, and Brucella. In Alphaproteobacteria, methylationof adenine regulates the cell cycle and couples gene transcription toDNA replication. In Gammaproteobacteria, adenine methylation providessignals for DNA replication, chromosome segregation, mismatch repair,packaging of bacteriophage, transposase activity, and regulation of geneexpression. In the fungus Neurospora crassa, cytosine methylation isassociated with relics of a genome defense system and silences geneexpression by inhibiting transcription. Many other characteristicepigenetic modification patterns in microorganisms have been describedin the literature and are known to the ordinary practitioner. Thepatterns of epigenetic modifications for these different purposes aredistinct and characteristic of a particular organism, providing a methodto distinguish one organism from another in a manner that complementspolynucleotide sequence data. In this way, epigenetic modificationdetection during nucleic acid sequencing to produce polynucleotidesequence reads comprising both polynucleotide sequence and modificationdata facilitates mapping polynucleotide sequences to their sourcemicroorganism genomes. In particular, the sequence and methylationinformation is useful for scaffolding DNA from multiple organisms tohelp identify DNA fragments that come from a single organism, since DNAfrom a single organism will have the same methylation patterns (commonmethylated sequence motifs). This capability is a tremendous advantagewhen analyzing metagenomic samples comprising multiple different typesof organisms.

In preferred embodiments, native DNA is extracted from a metagenomic(e.g., mixed) sample that contains an unknown number of different typesof microorganisms. Isolated DNA is converted into sequencing libraries(e.g., shotgun sequencing libraries) and sequenced using asingle-molecule, real-time sequencing technology that is capable ofdistinguishing modified (e.g., methylated) nucleotides from non-modifiednucleotides within a template nucleic acid. In certain embodiments SMRT®sequencing is used, but other technologies can also be used, e.g.,nanopore sequencing (e.g., from Oxford Nanopore Technologies, Oxford,UK). Further, although long-read technologies that provide single readsof at least 1000 kb or more are preferred, short-read sequencingtechnologies can also be used, e.g., alone or in combination withlong-read technologies, and these include technologies from, e.g.,Illumina, Life Technologies (Ion Torrent® or SOLID® systems), and Roche(454® systems). Further, circular consensus sequencing (CCS) can also beused to repeatedly sequence a given template molecule to provide aconsensus sequence for that template based only on reads from a singlemolecule. Although the read may be long, the template sequence itself istypically much shorter and there are multiple copies of the templatesequence (and/or its complement) within the read. As such, the reads canbe treated as short reads even though the actual full-length read ismuch longer. For very complex samples with high diversity (or toidentify low copy number constituents of the population) wheresequencing coverage of a given position is likely to be low, CCS mayenhance identification of methylation and modified motifs.

The resulting sequencing data is analyzed for the presence ofmodifications including determination of sequence motifs that aremodified. The reads can be grouped based on similarity of the motifsthat are modified. Reads that have different methylation/modificationpatterns are likely to be from different organisms, or at least fromdifferent chromosomes (e.g., in the case of diploid or other multiploidorganisms), and reads that have the same methylation/modificationpatterns are identified as possibly originating from the same organism.Where short- and long-read sequencing data is used in combination, thelong reads are especially useful for determining where the short-readsmap, which helps with the assembly of the metagenomic sample into largercontigs and with whole genome assembly.

Sequencing is preferably carried out on native (non-amplified) DNA toensure that the modifications are actually present during the sequencingreaction. Some sequencing methods (e.g., ensemble methods) rely on anamplification of the template prior to sequencing, but these methodseffectively remove the modification prior to the sequencing reaction. Insome embodiments, a combination of amplified and non-amplified templateis used. For example, the modifications can be detected duringsequencing of non-amplified template, and a small portion of theoriginal template can be amplified and sequenced to provide additionalcoverage for only the base sequence information. This is particularlyuseful in situations where the starting material is limiting for librarypreparation; the bulk of the native DNA is sequenced for modificationdetection, and amplified material is used to generate high sequencecoverage (without modification information).

The method is applicable for analysis of essentially any modificationthat is detectable, and is particularly useful for modifications thatfrequently occur in microorganisms, including, but not limited to, 6-mA,4-mC, 5-mC, phosphorothioate, glucosylated hmC, etc. Further,enrichment-based sample prep can be performed to specifically isolate anorganism or class of organism of interest, e.g., bacterial cells, viralparticles, phage particles, or other specific sub-populations.

This method is also applicable to other types of mixed samples. Forexample, in further embodiments, sequencing reads that include bothpolynucleotide sequence read data and modification data can be analyzedto distinguish between DNA from human and DNA from microorganism, e.g.,in a sample collected from a human patient. For example, blood can bedrawn from a patient showing symptoms of sepsis or other blood-borneinfection and the nucleic acids in the blood can be sequenced to providea set of sequence reads. The origin of the sequence reads, whether humanor another organism in the blood, can be determined based on theepigenomic modifications present in each sequence read, e.g., incombination with the polynucleotide sequence data itself. Once assignedto the appropriate organism, the sequences can be analyzed to bothidentify the microorganism and, optionally, to screen the patient's DNAfor genetic markers that indicate efficacy and/or likelihood of adverseevents for various drugs that can be used to treat the infection. Thiswill allow the organism to be typed and a best treatment to beidentified quickly, based on sequencing reads from a single sequencingreaction mixture, thereby reducing the time between admission of thepatient and initiation of a treatment regimen. Similar methods can beused to identify other microorganisms infecting a human patient, e.g.,in cultures taken from the throat, ears, nose, stomach, bladder, feet,skin lesion, etc.

Prenatal screening can also benefit from the methods described herein.Cell sorting technologies are commonly used to isolate embryonic cellsin maternal blood samples, and the isolated embryonic cells can then besubjected to various screening assays to determine the presence ofvarious genetic abnormalities. The present invention obviates the needfor cell sorting and instead sequences both maternal and embryonicnucleic acids together in a single sequencing reaction mixture togenerate polynucleotide sequencing reads comprising both polynucleotidesequence data and modification data. Since non-CpG methylation isprevalent in embryonic stem cells but not in differentiated cells, thedetection of this epigenetic modification within a sequencing read isindicative that the sequencing read corresponds to embryonic geneticmaterial, thereby facilitating distinction between maternal andembryonic DNA. Similarly, stem cells in blood samples also havedifferent methylation patterns than differentiated cells, so sequencingof blood samples that includes modification detection can be used toidentify stem cell-derived nucleic acids from other nucleic acids in theblood of an individual.

In specific embodiments, the methods herein provide a method ofdiagnosing an individual (e.g., an unborn fetus) at risk of having animprinting-dependent disease or other condition. For example,Prader-Willi syndrome (PWS) is caused by the deletion of the paternalcopies of the imprinted SNRPN and NDN genes along with clusters ofsnoRNAs including SNORD64, SNORD107, SNORD108, SNORD109, and SNORD116(HBII-85), which are on chromosome 15 in the region 15q11-13. Due toimprinting, the maternally inherited copies of these genes are silencedand only the paternal copies are expressed, so an individual with apaternal chromosome with this region deleted will exhibit PWS.Similarly, Angelman syndrome (AS) is caused by lack of expression ofanother gene in this region, UBE3A. This gene is found in the region15q12 and while both the maternal and paternal copies are active in manyof the body's tissues, UBE3A is normally expressed only from thematernal chromosome in certain areas of the brain due to imprinting ofthe copy in the paternal chromosome. If the maternal copy of the UBE3Agene is lost because of a chromosomal change or a gene mutation, aperson will have no active copies of the gene in some parts of thebrain. PWS and AS are the first reported instances of imprintingdisorders in humans. Both PWS and AS can be caused by several differentkinds of genetic phenomena, including deletion, uniparental disomy, ormutations that cause inactivation of the relevant gene(s). As describedabove, the sequencing methods herein can simultaneously produce sequencereads for a plurality of different nucleic acid molecules in a singlesample. As such, a maternal blood sample can be used to sequence nucleicacids from both maternal and fetal cells in the mother's blood. Thematernal sequence can be used to identify which sequences in the fetalnucleic acids are maternal in origin, and analysis of the 15q11-13region for both homologs of the fetus will provide a diagnosis ofwhether the fetus is affected with either PWS or AS. For example, if thefetal nucleic acids have a deletion of the maternal copy of UBE3A, adiagnosis of AS is made; and if the fetal nucleic acids have a deletionof the paternal copies of the 15q11-13 region, a diagnosis of PWS ismade. While there are various known methods for diagnosing thesedisorders, there is no single method that can both diagnose anddistinguish between the various subtypes. The present invention providessuch a method. This information can be useful for both treating theaffected individual, as well as discerning a risk that a sibling willalso be affected. Similarly, the prenatal diagnosis of other diseases,disorders, or other phenotypes due to imprinting and/or otherparent-of-origin-dependent expression patterns is also contemplated.

In related embodiments, the methods herein provide a simplified methodfor detecting and sequencing cancer cells in a tissue sample from anindividual. Aberrant DNA methylation patterns have been associated witha large number of human malignancies and found in two distinct forms:hypermethylation and hypomethylation compared to normal tissue. Forexample, hypermethylation typically occurs at CpG islands in promoterregions, is associated with gene inactivation, and is one of the majorepigenetic modifications that repress transcription of tumor suppressorgenes. Hypomethylation is associated with chromosomal instability.Further, previous studies have shown that tumors form distinct subtypesbased on the degree of CpG-island methylation. (See, e.g., Toyota et al.(1999) Proc. Natl. Acad. Sci. USA 96:8681-8686, and Weisenberger et al.(2006) Nature Genetics 38(7):787-793, incorporated herein by referenceit their entireties for all purposes.) Identification of such aberrantmethylation patterns in a sequencing read from a biological sample isindicative that the source of the sequencing read is a cancerous cell.Yet further, the patterns of methylation can be used to identify asource of the cancer cell in the body of an individual due to the knownmethylation patterns that characterize different tissues of the body andact to maintain the differentiation of different cell types in thosetissues. For example, if a blood sample is sequenced to revealpolynucleotide sequence reads having aberrant methylation patternscorresponding to cancer, DNA regions outside of the aberrant methylationpatterns can be analyzed to determine where in the body the cancer celloriginated based upon the differentiation-based methylation patternsfound in normal differentiated cells from different tissues in the body.As such, the modifications within a polynucleotide sequence read can notonly identify the read as having come from a cancer cell, but can alsoidentify where in the body the cell originated because of themaintenance of tissue-specific epigenetic modifications. It is expectedthat this will allow much faster identification of a primary tumor inthe body, which can lead to quicker initiation of treatment and mostlikely a better outcome, e.g., higher cure/survival rate.

Due to the rarity of certain cell types in some bodily fluids ortissues, in certain embodiments enrichment procedures are used prior tosequencing to increase the number of cells of interest in a givensample, some of which have been discussed supra and many of which arewell known and routinely used in the art. For example, antibodies tosurface proteins specific to a certain type of cell, e.g., cancer cell,can be used to enrich a mixture of the cells of interest. Thisenrichment need not be complete, since the polynucleotide sequence readsof the cells of interest can be distinguished from “contaminating” cellsas described herein. Even where highly efficient cell sorting technologyis used, the methods herein provide a further assurance that thepolynucleotide sequence data being attributed to a cell of interestactually corresponds to that cell by validating the identity of thesource cell through analysis of epigenetic modifications within thepolynucleotide sequence reads.

In yet further aspects, methods herein facilitate determination ofmodifications within a forensic sample, thereby providing a furthercharacteristic that can be used to identify the source, authenticate asample, separate nucleic acids in a sample that potentially has multiplesources, etc. For example, detection of methylated bases in a forensicsample can confirm that the sample is biological in origin, e.g., thatit has not been amplified. A sample that has been amplified did not comedirectly from biological material, and may be proof of contaminationand/or evidence tampering. In certain embodiments, modifications causedby damage to nucleic acids, e.g., from exposure to UV, oxygen, etc., isidentified and used in the analysis. For example, a dried blood sampleis expected to contain damaged bases consistent with environmentalexposure, so DNA-damage-related modifications within sequencing readsfrom the sample can provide further confirmation that the sequence readscame from the dried blood sample and not some other contaminatingsource. Further, the amount of damage can be indicative of how long thesample was exposed. In the field of forensic science it is desirable tobe able to distinguish whether a DNA sample taken from a scene in thefield derives from biological tissue or was synthetic DNA placed at thescene to confuse analysis or attempt to produce an incorrect result. Thepresent methods can be applied to simultaneously extract sequenceinformation as well as establish whether the DNA contains methylationpatterns consistent with the supposed biological source.

VII. Expression Analysis Based on Modification Status

Cellular differentiation in eukaryotes involves epigenetic changes thatalter gene expression in cells different tissue types. Duringmorphogenesis, totipotent stem cells become the various pluripotent celllines of the embryo, which in turn become fully differentiated cells,e.g., neuronal cells, muscle cells, epithelium, endothelium, etc. Thisoccurs through activation of some genes and inhibition of other genes,as well as certain structural changes in the DNA, to produce all thedifferent cell types of an organism. The most common epigenetic changeis addition of methyl groups to DNA, mostly at CpG sites, to convertcytosine to 5-methylcytosine. Approximately 60-90% of CpGs aremethylated in the human genome. Some areas of the genome are moreheavily methylated than others, and highly methylated areas tend to beless transcriptionally active. Further, the particular areas that areheavily methylated in one tissue or cell type are different from thegenomic regions that are heavily methylated on other tissue or celltypes, and these differing patterns of methylation are directly relatedto differing patterns of expression of genes. As such, detection ofmethylation levels in various regions of the genome, e.g., withinpromotor, exonic, or intronic regions, is indicative of the activity ofgenes within those regions. By performing sequencing reactions thatinclude modification data such as methylation status, secondarystructure, and the presence of bound proteins or other factors, thepresent invention provides a method for determining expression levelsthat is not dependent upon isolation and sequencing of mRNA transcripts.

For example, histone modifications are known to be associated with geneexpression, so determining the location of histones and how tightly theyare bound to the DNA is indicative of expression of a gene in thevicinity. For example, histones are modified by methylation and/oracetylation. While methylation can either strengthen or weaken bindingof histones, depending on the location of the modification, acetylationhas generally been found to change the binding of histones such thattranscriptional activity is increased. Likewise, deacetylation has beenfound to decrease transcription. Since the affinity of the histone has adirect effect on the availability of the gene for transcription, mappinghistones and the strength of their binding within a DNA region isinformative as to the transcriptional activity of the region. Binding ofhistones to DNA can be detected through the same kinetics-baseddetection methods as are used for other types of modifications. Thesequencing process slows when a histone is encountered and resumes onlywhen the histone has been displaced, so a histone that is bound morestrongly is expected to take longer to displace. As such, the pause ischaracteristic of the modification status of the histones encountered,which is in turn characteristic of the transcriptional activity of theregion.

The pattern and strength of histone association with a DNA region isindicative of the transcriptional activity of that DNA region. Inpreferred embodiments, modifying agents are used to mark the DNA andallow footprinting of the histone proteins (as well as other boundagents). In certain preferred embodiments, the DNA is marked or taggedwith a modifying agent in a manner that is specific for regions in whichhistones are absent, e.g., since the regions wrapped around histones areunavailable for marking. For example, the genomic DNA can be subjectedto modification with a methyltransferase, glucosyltransferase, or anenzyme that introduces phosphorothioate modifications. Preferably, themodifying agents introduce modifications that are absent from the nativeDNA so that the introduced modifications are distinguishable from anynative modifications already present in the DNA sample. For example, amodification that is present in bacteria but absent from human DNA canbe used to footprint histone proteins in a human DNA sample.Subsequently, the DNA is treated to remove only those histones that havea lower affinity for the DNA prior to sequencing, thereby only removinghistones in regions that are transcriptionally active. The addedmodifications are detectable during sequencing, and three types ofregions are identified in the resulting sequence reads. First, theregions that comprise the added modifications are regions that were notbound by any histones. Second, the regions that lack the addedmodifications are identified as regions previously associated withhistones in transcriptionally active areas. Third, the regions thatdisplayed kinetics consistent with a bound histone are identified asregions in transcriptionally inactive areas. These methods can be usedin the absence of, or alternatively in combination with mRNA sequencingto analyze gene expression levels in different cells and tissues.

In certain aspects, sequencing methods that provide kinetic dataindicative of modifications within a nucleic acid template are used toscreen cell lines. For example, the status of a pluripotent cell linepreparation can be assayed to determine what portion, if any, of itsconstituent cells are pluripotent cells. Using standard methods in theart, pluripotent cell lines can be created by contacting differentiatedcell lines with one or more reprogramming agents that contribute toreprogramming of the differentiated cells to a pluripotent state. Thecell line is grown in the presence of the reprogramming agent(s) for aperiod of time sufficient to begin reprogramming of said cell line. Theprogress of the cell line toward the pluripotent state can be monitoredby taking periodic samples, extracting the nucleic acids, and sequencingthem using a method that provides kinetic data indicative of themodification status of the nucleic acids. Modifications characteristicof pluripotent cells (e.g., non-CpG methylation) are identified and usedto determine the prevalence of pluripotent cells in the cell culture.For example, a prevalence of non-CpG methylation in the cell line isindicative that the differentiated cell line has been reprogrammed intoa pluripotent cell line.

VIII. Other Related Applications

Second generation sequencing technologies (e.g., SOLid® and Solexa®sequencing) have read lengths that are too short to effectively sequencelong, highly repetitive genomic regions. Such regions include thecritically important centromeric and telomeric regions, which are oftencharacterized as heterochromatic regions, tightly packaged areas in thegenome that tend to be less transcriptionally active than the lesstightly packaged euchromatic regions. Other highly repetitive regionsinclude those containing microsatellites or other regions havingvariable copy numbers of short-repeat sequences. Heterochromatic regionsalso tend to be highly modified, and second generation sequencingtechnologies that do not provide kinetic reaction data are not generallyable to identify such modifications based upon a single sequencing read.In contrast, sequencing technologies that provide long read lengths(e.g., at least about 500, 1000, 1500, 2000, 3000, 500, or 10,000 basesor more) are able to extend deeply into these regions, preferablyspanning them completely to provide not only sequence and kinetic data,but also copy number. Further, where single-molecule kinetic reactiondata is also provided, the modifications within these regions can alsobe fully characterized within a single sequencing read, e.g., withoutthe need to treat the genomic DNA with harsh, DNA-damaging chemicalssuch as bisulfite. This ability to sequence through a highly repetitiveregion and, simultaneously, detect modifications therein is ofparticular importance in the study of diseases and disorders caused bychanges in repeat regions within a genome. There are many repetitiveregions within the human genome that are associated with suchdiseases/disorders, which include e.g. myotonic dystrophy, Huntington'sdisease, amyotrophic lateral sclerosis-frontal temporal dimensia(ALS-FTLD), long-QT syndrome, spinocerebellar ataxias, fragile Xsyndrome, and the like. Determining not only how many repeats arepresent, but also any modifications within those repeats can providemuch-needed data for development of diagnostics, prognostics, andtheranostics for patients.

In other embodiments, as noted above, heterochromatic DNA can be treatedto footprint agents bound to the chromatin, e.g., histone proteins, RNAfactors, regulatory proteins, etc. During a subsequent sequencingreaction, long sequencing reads are generated, and these sequencingreads comprise both polynucleotide sequence information and kineticinformation, and the structure of the genomic DNA is analyzed. Althoughthe footprinting assay described above with regards to expressionanalysis involves removal of only a portion of the histones, i.e., thosethat are less tightly bound, footprinting assays can also be performedthat remove all bound agents prior to sequencing. The patterns of theintroduced modifications present in the resulting sequencing reads istherefore indicative of all positions at which the modifying agent wasblocked, which are indicative of an agent bound to the genomic nucleicacids sample during the modification reaction.

In related embodiments, regions of euchromatin and hetrochromatin can bedistinguished based upon the density of modifications introduced. Sinceheterochromatin is more tightly structured, fewer bases are accessibleto a modification enzyme, but euchromatin tends to be more looselystructured with a higher percentage of bases available for modification.As such, treating a chromosome with a modification enzyme, removing thenucleic acid from the chromatin, and sequencing the nucleic acid allowsidentification of blocks of sequence that are putative heterochromatin(low density of modifications) and putative euchromatin (high density ofmodifications). This same method will also identify regions in theeuchromatin where agents are bound, since there will be a “footprint”that lacks modification at loci that are blocked, e.g., by a boundprotein. Yet further, chromatin can be subjected to specific treatmentsthat remove only part of the higher-order structure prior tomodification, and the patterns of modification that are produced afterdifferent types of treatments is indicative of the locations of thevarious different constituents of the higher-order chromatin structure.

Further, analysis of methylation patterns in pathogenic organisms canprovide insight into the biological basis for their pathogenicity. Thisinsight can be used to determine appropriate treatments, or to developnew treatments for combating individual infections and widespreadoutbreaks. A recent study has implicated the activity of amethyltransferase in the pathogenicity of an E. coli strain responsiblefor an outbreak in Germany, and this finding is indicative thatmethyltransferases, and potentially other modifying agents, may be goodtargets for drug development efforts to combat outbreaks. Moreinformation on the role of long readlength combined with modificationdetection during single-molecule real-time sequencing is provided in thearticle entitled, “Origins of the E. coli Strain Causing an Outbreak ofHemolytic-Uremic Syndrome in Germany” (Rasko, et al., New Engl. J. Med.2011; 365:709-71), which is incorporated herein by reference in itsentirety for all purposes.

Yet further aspects of the invention are directed to analysis ofhemimodification analysis. When modified DNA is replicated in a cell,the daughter duplexes have one strand from the original DNA and theother strand is “nascent” or newly synthesized based on the sequence ofthe parental strand. When the DNA to be replicated has modified bases(e.g., methylated bases), the resulting duplex is “hemi-modified,”meaning that only the parental strand has the modification. As such, thenascent strand must be modified after synthesis to add these modifiedbases, thereby providing a daughter duplex having the same modificationsas the original parental duplex. In vivo, specific enzymes (e.g.,maintenance methyltransferases) are responsible for identifyinghemi-modified nucleic acids and restoring the original modificationcomplement of the parental duplex molecule. Maintenance of themodification status of the genome of an organism is important for thecontrol of transcriptional activity, as well as distinction between“self” and “non-self” nucleic acids, e.g., for pathogen identification.Given that hemi-modified nucleic acids are important in cellularmechanisms involving cell division, DNA repair, and self-recognition,the ability to identify these transient molecules in genomic DNA samplewill allow identification of actively dividing cells, and provideinsight in to biological outcomes when such modification status is notproperly maintained, e.g., in cancer development or pathogenicinfection. Single-molecule, real-time sequencing methodologies allow thedata necessary to not only map a sequencing read to a particular genomicregion, but also to determine the modification status of the genomicnucleic acid itself. In particularly preferred embodiments, both themodified and unmodified strand are sequenced in a single templatemolecule such that a single resulting sequencing reads has both basesequence data and kinetic data that is indicative of the modificationstatus of both strands of the template. Preferred template molecules areprovided, e.g., in U.S. Patent Publication No. 2009/0298075, which isincorporated herein by reference in its entirety for all purposes.

IX. Screening Methods

DNA methylation patterns are known to be established and modified inresponse to environmental factors by a complex interplay of multipledifferent DNA methyltransferases, e.g., DNMT1, DNMT3A, and DNMT3B. (See,e.g., Li, et al. (1992) Cell 69(6):915-926.) Knowledge of the patternsthat are indicative of various environmental exposure in combinationwith sequencing methods that generate both polynucleotide sequence dataand modification data provides a screening method for determiningwhether an individual has been exposed to an environmental trigger,e.g., toxin, teratogen, carcinogen, radiation, malnutrition, pathogen,infection, etc. In certain embodiments, the screening comprisesdetecting DNA damage-related modifications, e.g., 8oxoG, thymidinedimers, and crosslinking, which are indicative of an exposure to achemical or radiation that induces such damage.

In other embodiments, the screening comprises detecting modificationsrelated to activation of an immune response. Such an activation can beindicative of an exposure to a pathogen or other infectious agent, orcan be a response to an injury. In yet further embodiments, thescreening comprises detecting modifications related to changes in theendocrine system, which can be indicative of exposure to environmentaltoxins, such as endocrine disruptors. The effects of exposure toendocrine disruptors is of particular interest because it can causecancerous tumors, birth defects, and other developmental disorders.Changes in the endocrine system can also be the result of naturalchanges that occur during aging, e.g., puberty- or menopause-relatedchanges. Given the interest in studying and managing both early-pubertyand early-menopause, these methods provide a strategy for identificationof such changes even before they are outwardly visible in an individual.Further, the specific endocrine changes taking place may be indicativeof the cause of the physical manifestation. For example, the particularchanges in modification patterns in an individual experiencingprecocious puberty could help to determine if the early onset of pubertyis due to exposure to a particular environmental toxin or not.

Yet further, it is well known that underlying a phenotypic trait areboth genetic and environmental factors. Detection and mapping ofepigenetic modifications provides a strategy for teasing apart thecontributions of each by monitoring environmentally-induced epigeneticchanges and determining the effect of those epigenetic changes on thephenotype of an individual. Such changes may be found in the DNA itself,or may be found in RNAs, which can change the translation of the RNAsinto proteins, thereby impacting phenotype of an organism.

Further, where a treatment regimen involves changing expression patternsin an individual's genome, modification sequencing can be used as a toolto determine whether a treatment is working. For example, where a drugis intended to increase expression of a gene or gene pathway, and thatincreased expression is due to a known change in methylation of thegene(s), nucleic acids from the tissue where the change is expected canbe sequenced prior to treatment (control) and during treatment todetermine if the treatment is having the intended effect (isefficacious). Further, the individual could be screened periodicallythereafter to screen for development of drug/treatment tolerance thatmight indicate a need for an increased dose or an alternative treatment.

X. Genome Tagging

While many of the methods described herein involve detection ofmodifications already present in a nucleic acid sample from a biologicalsource, in certain embodiments modifications can be added to a nucleicacid sample, e.g., to “tag” or “barcode” it so it can be identified in alater step of an analysis by virtue of the presence of themodifications. In some preferred embodiments, barcodes are used to tagnucleic acids from different sources so they can be pooled, sequenced ina single reaction volume, and the sequences linked back to the source byvirtue of the barcode sequences within the sequence reads. In certainpreferred embodiments, the modifications introduced to tag or barcode asample are not native to the sample. While modification-based barcodesare discussed at length herein, it will be understood that other typesof barcodes can also be used to tag nucleic acids in the methodsdescribed herein, e.g., barcodes lacking modifications but having uniquesequence identifiers.

In certain embodiments, nucleic acids from two different sources aresequenced in a single nucleic acid sequencing reaction mixture. Tofacilitate distinction of the sequence reads from one source from thesequence reads from another source, one of the nucleic acid samples canbe altered by adding a modification that is absent from the othernucleic acid sample. The two samples can then be combined and sequencedtogether. The resulting sequencing reads comprise both sequence data andmodification data, the latter of which is used to assign a givensequencing read to a particular source. In one such embodiment, amodification is added to DNA from a first twin, but not to DNA from thesecond twin. The two DNAs are sequenced together and any differencesfound can be mapped to either the first or second twin based upon thepresence or absence of the modification, respectively. In certainembodiments, the sequencing is performed iteratively with a firstiteration having the first twin's DNA modified, and the second iterationhaving the second twin's DNA modified. This allows interrogation of notonly sequence data, but also modification data that is present in thenative from of the unmodified DNA. In certain embodiments, themodification added is not naturally present in the DNA of the twins'species.

In some embodiments DNA modification information can be used to tagdifferent portions within a single genome or identify locations within aset of genomes. For example, in the field of genome assembly, it isoften useful to have, for a particular sequence, approximate informationabout its location of origin within a genome. Methods such as opticalmapping can provide this information but are expensive and require extraprocessing steps. Methods to produce systematic patterns of DNAmodification can help provide this approximate location information. Inthis role it is analogous to the use of endogenous methylation to assistwith assembly, but without the dependency on an endogenous source ofdiversity in modification with position. One method of producing such aposition-dependent modification pattern is by cultivating synchronizedcell populations in medium containing modified nucleotide substrates.These modified nucleotides can comprise chemical groups that are foundin the organism naturally, such as methyl groups on adenosine, ormodifications that are found in living organisms but are not native tothe organism in question, or can represent entirely non-naturalmodifications. Means such as electroporation may be needed to ensurethat cells take up the nucleotides, but once the modified nucleotidesare available to the genome synthesis process, the DNA will take upmodified bases in proportions that reflect the mixture that is applied.By changing the mixture over the course of genome synthesis, different“tags” can be applied to different locations within the genome. The tagscan be binary (the presence or absence of a particular modified baseindicating a particular range of locations, or continuum (where theproportion of modified nucleotides in the DNA encodes continuousvariable such as position). In a preferred embodiment, a continuousgradient is applied beginning with the origin of replication at onelevel of modification and continuously changing by adjustment of theproportion of nucleotides in the growth medium to a different level atthe end of genome synthesis (in the synchronized population). In thisembodiment, an estimator of modification density would be computed overa window of fixed or variable size and the modification density can thenprovide information to confirm or refute a potential genome assemblystructure. In another preferred embodiment, the level of modificationwould be changed periodically two or more cycles over the duration ofgenome synthesis and would provide finer-grained resolution on positionat the expense of ambiguity as to which of the cycles is represented.This would be better suited to repeats that have a structure that issmaller compared with the genome size, while the former method would bebetter with repeats that have a characteristic size scale that iscomparable with the genome size. When repeats have characteristic repeatsizes scales that are both large and small, both methods can be appliedsimultaneously using different modifications. In these methods, the typeof modification would be inferred either from the base that is modifiedor from the kinetic signature that is observed, or a combination ofthese features, then after binning by modification type, the windowcalculation would be performed as before and then the results from thetwo methods would be used in tandem to provide a location that is bothprecise and unambiguous. This method can be extended to as manynucleotide types as needed, limited only by counting statistics of howmany modified bases need to be counted to produce a reliable estimator,and how many of those bases are contained within a particular windowsize.

Another method of producing position-dependent modification is byrestriction digestion of fragments followed by size separation on a gel.In this method, restriction endonucleases that cut infrequently(compared with the read length of the sequencing system) would bedesirable in order to produce information that is useful above andbeyond what normal assembly algorithms can achieve. For example, with a30,000 base read length, it would be desirable to produce a range offragment sizes that range from 60,000 bases up to several hundredthousand bases. These can then be size-fractionated on a gel, cut intosmall bins, chemically modified by any of a number of methods (forexample, through the use of different doses of the nonspecificmethyltransferase mentioned elsewhere in this document. These fractionscan then be pooled and sequenced. During sequencing, the degree ofobserved modification in the DNA would then provide an upper bound onhow far it is to the nearest restriction recognition sequence, and thisinformation would be useful in resolving which of several possibleassemblies is the correct assembly.

Another method of producing position-dependent modification is by usingcloning methods. If a BAC library or fosmid library is constructed fromthe genome of interest, sampled into pools with less than 1× coverage ofthe genome and then methylated to different degrees, this will allowidentification of the pool from which the molecule is drawn. Thisinformation can then be used to confirm or refute hypotheses derivedfrom sequence assembly as to the correct structure of the genome.

Another method of producing position-dependent modification in the DNAwould be to physically elongate molecules having a common terminus andthen subject the molecules to a spatially varying dose of a modifyingagent or a cofactor or substrate of the modifying agent. Methods forelongating molecules are known in the art, e.g., Turner, et al. (2002)Physical Review Letters 88: 128103-1-128103-4; Cabodi, et al. (2002)Analytical Chemistry 74: 5169-5174; Tegenfeldt, et al. (2004) AnalBioanal Chem 378: 1678-92; Persson, et al. (2010) Chemical SocietyReviews 39: 985-999; and Hastie, et al. (2013) PLoS ONE 8(2): e55864,all of which are incorporated herein by reference in their entiretiesfor all purposes. Another method of elongating the molecules is toattached them to a solid support and apply a perpendicular force usingelectric fields or gradient magnetic forces on magnetic particlesattached to the DNA molecules. Then the gradient of modification can beapplied by using diffusion to create a gradient of agent dose or othermeans known in the art for creating a vertical gradient.

In some embodiments, gene expression can be determined by growing cellsin a medium comprising modified rNTPs. The mRNAs from the cells arecollected and sequenced to determine which comprise the modified rNTPs,and were therefore transcribed during the growth in the mediumcomprising the modified rNTPs. The cells can be treated by differentenvironmental stimuli or agents to assess how these changes in growthconditions impact transcription. The method is especially helpful indistinguishing between mRNAs that have a long lifetime in a cell versusthose that are being expressed at a high level because the former willonly have a small number of transcripts with the modified rNTPs even ifthere is a large number of total transcripts present in the cell. Incontrast, the latter will have all or nearly all transcripts comprisingthe modified rNTPs.

Similarly, a culture comprising multiple different microorganisms can beanalyzed to determine which are actively growing, e.g., under differingconditions, by adding modified dNTPs to the culture. The DNA from themicroorganisms is collected and sequenced, and the organisms thatincorporated the modified dNTPs into their DNA are identified as thosethat were actively growing in the culture. As such, sequencing data thatcomprises both polynucleotide sequence data and modification data cannot only identify the organism based on similarity to known referencesequences, it can also identify which are actively growing in a mixedculture.

In certain aspects, the invention provides methods for determiningsequence and modification differences between nucleic acids fromdifferent tissues that involve incorporation of nucleic acids from bothsources into a single template nucleic acid molecule. Such methodstypically involve treating the nucleic acids from a first source with amodifying agent as described elsewhere herein, hybridizing nucleic acidsfrom both sources together to produce double-stranded hybrid moleculescomprising one nucleic acid strand from the first source and one nucleicacid from the second source. Hybrid molecules are selected from themixture, which also comprises duplexes that are not hybrids, e.g., thathave two strands from the same source. Such a selection is typicallybased upon secondary structures that form due to mismatches between thetwo strands of the hybrid molecules. For example, where one source istumor tissue and the other source is normal tissue, it is expected thatlarge-scale mutations (large insertions, deletions, translocations,inversions, etc.) in the tumor-derived nucleic acid will cause a loopingout of the nucleic acid from the normal tissue due to a lack of acomplementary sequence in the tumor-derived strand. As such theselection can be based upon binding of single-strand-specific agents tothe hybrid molecules and the capture of these single-strand bindingagents, e.g. through affinity binding methods, bead capture methods, andthe like. The nucleic acids not captures are removed and the hybridmolecules are released into solution. The hybrid molecules can befurther manipulated to prepare them for sequencing, e.g., by addition ofadaptors, preferably stem-loop adapters, and this can be performedeither before or after the selection step. The hybrid molecule is usedas a template in a sequencing reaction in which both polynucleotidesequence data and modification data are collected, and this data isanalyzed to both provide a sequence for both strands of the hybridmolecule and detect the modifications within these strands. The portionsof the polynucleotide sequence read(s) collected during the sequencingthat comprise modifications added by the modifying agent are identifiedas being from the first source, and the portions of the polynucleotidesequence read(s) collected during the sequencing that do not comprisemodifications added by the modifying agent are identified as being fromthe second source. The differences between the polynucleotide sequencedata from the first and second source are further analyzed based uponknowledge about the sources. For example, where the first source istumor tissue and the second source is normal tissue, differences betweenthe polynucleotide sequence data are likely indicative of mutations inthe tumor-derived nucleic acid that are related to the tumor phenotype.In certain embodiments, the mismatches in the hybrid molecules arerepaired using nucleotides having distinguishable modifications, suchthat they are identifiable from the modification data collected from thesequencing reaction. The loci at which they are detected are indicativeof a previous mismatch between the nucleic acids from the two sources,although the nature of the mismatch may not be discernible.

XI. Method for Identification of Novel Methyltransferases in BacterialStrains

Identifying novel methylation enzymes is challenging, but a method isprovided herein for rapidly identifying such enzymes and/or bacterialstrains producing such enzymes. The method uses single-molecule,real-time sequencing (e.g., SMRT® Sequencing from Pacific Biosciences,Menlo Park, Calif.) to detect methylated nucleotides within a plasmidtemplate molecule. The sequencing data generated can also be analyzed todetermine sequence specificity of the enzyme, as well as its putativerestriction enzyme partner, such as in the case of prokaryoticrestriction-modification (RM) systems.

The sequence of the AllMer plasmid has all possible combinations ofmulti-base motifs, e.g., preferably four-base, five-base, and/orsix-base motifs. The region of the plasmid comprising the multi-basemotifs is preferably about 1000 base pairs in length, but can be longer,and can include multiple copies of one, some, or all of the multi-basemotifs. The plasmid also comprises a vector region that comprises allthe necessary elements for plasmid propagation in a host organism, e.g.,a bacterial strain. The vector region is typically about 2-3 kb inlength. The plasmid also comprises a restriction site, preferably withinthe vector region. Cleavage at the restriction site provides adouble-stranded linear molecule having a blunt end or a 3′ or 5′overhang, depending on the type of restriction site. Stem-loop adaptorsare ligated to the ends of the linear molecule to produce a sequencingtemplate, termed a “SMRTbell™ template,” that is a topologically closedmolecule that forms a single-stranded circular template when thedouble-stranded portion is separated (since at each end the termini areconnected via a stem-loop adaptor). The sequencing template is typically3-6 kb in length, but can be longer. In some embodiments, there are tworestriction sites that flank the multi-base motif region so thesequencing template need not comprise the vector region, whose sequenceis already known.

The plasmid is cotransfected into a host organism with an expressionvector encoding a putative methyltransferase gene or other genomicsequence that might contain such a gene. The host organism is propagatedand the AllMer plasmid isolated, converted into a sequencing template,and sequenced, preferably using a sequencing technology that detectsmodified bases such as SMRT Sequencing. If the sequence of the AllMerplasmid is methylated, the locations of the methylated bases andflanking sequences are used to characterize the putativemethyltransferase, e.g., with regards to its sequence specificity,activity, and the like. In preferred embodiments, the host organismlacks any methyltransferases, or has only well-characterizedmethyltransferases with recognition sequences different from those ofthe putative methyltransferase in the expression vector.

In some embodiments, the adaptors linked to the ends of the linearizedAllMer plasmid comprise barcodes to allow multiplexed sequencingreactions from multiple host organisms. The sequences of the barcodesidentify from which host organism a particular AllMer plasmid wasisolated, and therefore what putative methyltransferase is responsiblefor the methylation detected. Since both strands of the AllMer plasmidare sequenced, strand-specific modifications are detectable. Further, inalternative embodiments, one or more additional modification-generatingenzymes could be included in the host organism, e.g., within theexpression vector, in a second expression vector, or may be expressedfrom the host organism's own genome. such modification-generatingenzymes can serve to enhance detection of modifications in the AllMerplasmid, e.g., by further modifying methylated bases. Such furthermodification includes, but is not limited to, glucosylation (e.g., by aglucosyltransferase) or hydroxylation (e.g., by a TET hydroxylaseenzyme).

In certain embodiments, rather than cotransfecting the AllMer with anexpression vector, the AllMer is transfected into a host organism andused to detect base-modification activity expressed by the host organismitself. In yet further embodiments, multiple different AllMer plasmidscan be used that still contain the same sets of multi-base motifs, butin detectably different conformations or orders such that sequencingreads would identify a particular AllMer plasmid. This embodiment iscontemplated as an alternative to the barcoding of the adaptors, sincethe identifying sequences would be within the AllMer plasmid sequencerather than the adaptors. Alternatively, multiple different allMerplasmids could be used in combination with barcoded adaptors.

XII. Methyltransferases as Analytical Tools

Methyltransferases are key to various different analytical toolsprovided herein, some of which have been discussed above. In all ofthese applications, both the presence of a particular modification, aswell as the sequence context in which it occurs, can be used to identifythe source of a nucleic acid of interest. In some preferred embodiments,the methyltransferases used are nonspecific, i.e., they add a methylgroup regardless of the sequence flanking a base to be methylated. Inother words, they have no particular sequence motif that they target,although they still methylate only those bases that are sufficientlyexposed to allow access by the enzyme. Further, although they typicallyhave no “recognition sequence” per se, they are specific for aparticular base to be methylated, e.g., A or C. For example, treatmentof naked DNA with a nonspecific adenine methyltransferase would catalyzeaddition of a methyl group to most or all of the adenines within the DNAmolecule.

In certain aspects, a methyltransferase is used to mark a nucleic acidso that it can be later identified and/or distinguished from othernucleic acids. For example, a nucleic acid sample directly from a sourcecan be methylated and then subjected to amplification. Data fromsubsequent sequencing of the resulting pool of original nucleic acidsand amplicons thereof would be able to distinguish between sequencereads from the original sample and the amplicons because the originalsample would be highly methylated and the amplicons would not. One ormore methyltransferases can also be cloned into an organism of interestso nucleic acids from that organism can be distinguished from nucleicacids from organisms that do not contain the methyltransferase gene. Assuch, “labeling” of the nucleic acids taking place in vivo allowsidentification of the source of the nucleic acids during a sequencingreaction. In a similar strategy, a cloned methyltransferase can beregulated by a promoter of interest, and the methylation status ofnucleic acids isolated from the organism is indicative of the activationstate of the promoter. Similarly, a primary nucleic acid sample can belabeled by treatment with a methyltransferase at the source to allowlater identification of contaminating nucleic acids, which would notcomprise the specific methylation pattern of the treated sample. Thiscould be particularly useful for forensic applications wherecontamination can result in an inability to identify the originalsample, or in the study of ancient nucleic acids, where contaminationcan result in “ancient” contigs having modern nucleic acid sequences. Inrelated aspects, such marking can be used as a security marker to markDNA having a proprietary origin. DNA modified using themethyltransferase is still useful for most applications, but it wouldcarry modifications that can be used to identify its source.

In further aspects, use of methyltransferases can be useful in designingconstructs for molecular biology applications. For example, artificialsequences (such as adaptors, primer-binding sites, etc.) can be taggedwith modified bases prior to linking them with a nucleic acid ofinterest (“target”). Subsequent sequencing that detects the modifiedbases uses that information, in combination with the sequence data, toidentify the artificial sequences and to distinguish them from thetarget nucleic acid. Addition of modification to either artificial orsample nucleic acids can also serve a protective function where it makesthe nucleic acid unrecognizable to an enzyme that would otherwise alterit. For example, the modifications may prevent internal cleavage by anendonuclease, degradation at a free end by an exonuclease, ligation toan exogenous sequence (e.g., used to remove “non-target” nucleic acidsfrom a mixture), addition of further modifications (e.g., DNA damage),etc. In a specific implementation, a potentially dangerous nucleic acidsequence, such as from a pathogenic organism, is treated to prevent itsexpression if it were to be accidently released, inhaled, injected, oringested. For example, a viral nucleic acid can be rendered biologicallyincompetent for integration into a host genome.

In yet further aspects, methyltransferases can be used to modify genomicfunction in various ways. For example, introducing modifications intothe genomes of pathogenic organisms, e.g., bacteria, fungi, or viruses,can be a strategy for inactivating the genome of the organism orotherwise altering the life/cell cycle, where enzymes needed for suchprocesses are sensitive to such modifications. For example, cancer cellscan be targeted by an expression vector (e.g., a plasmid or virus)carrying the desired methyltransferase, and once inside the cancer cellsthe methyltransferase is expressed and methylates the genome of thecancer cell, inactivating it or otherwise causing programmed cell death.In another example, a pathogenic bacteria can be targeted, “infected”with a methyltransferase, and the resulting changes in expressionpatterns either slow or stop the growth of the bacterium and allowclearance from the infected individual. Alternatively, growth of abeneficial bacterium could be enhanced by a similar method, but wherethe modification served to increase efficiency of nutrient uptake, speedthe cell cycle, or the like.

In a more subtle implementation, targeted modifications to genomicregions associated with transcriptional regulation, e.g., promoters,termination sequences, etc., serves to modulate transcription of a geneor genes of interest. For example, fusion of a methyltransferase orother modification-introducing gene product with a locus specificprotein (or nucleic acid) can allow targeting of the resultingmodifications to that locus in a genome or other nucleic acid molecule.For example, a transcription factor specific for a particular gene orgene family binds to the promoter sequence(s) and brings the attachedmodification enzyme into close proximity where it can introducemodifications to the promoter region and/or gene(s). Such modificationsalter expression of the gene(s), and can serve to increase expression,decrease expression, or entirely silence the gene. In a specificembodiment, a methyltransferase gene can be integrated into a hostorganism under the control of an externally activatable promoter. Thisfunctions as an artificial apoptosis mechanism, allowing thepractitioner to activate the expression and induce cell death at will.

XIII. Systems

The invention also provides systems that are used in conjunction withthe compositions and methods of the invention in order to provide forreal-time single-molecule detection of analytical reactions. Inparticular, such systems typically include the reagent systems describedherein, in conjunction with an analytical system, e.g., for detectingdata from those reagent systems. In certain preferred embodiments,analytical reactions are monitored using an optical system capable ofdetecting and/or monitoring interactions between reactants at thesingle-molecule level. For example, such an optical system can achievethese functions by first generating and transmitting an incidentwavelength to the reactants, followed by collecting and analyzing theoptical signals from the reactants. Such systems typically employ anoptical train that directs signals from the reactions to a detector, andin certain embodiments in which a plurality of reactions is disposed ona solid surface, such systems typically direct signals from the solidsurface (e.g., array of confinements) onto different locations of anarray-based detector to simultaneously detect multiple different opticalsignals from each of multiple different reactions. In particular, theoptical trains typically include optical gratings or wedge prisms tosimultaneously direct and separate signals having differing spectralcharacteristics from each confinement in an array to different locationson an array based detector, e.g., a CCD, and may also compriseadditional optical transmission elements and optical reflectionelements.

An optical system applicable for use with the present inventionpreferably comprises at least an excitation source and a photondetector. The excitation source generates and transmits incident lightused to optically excite the reactants in the reaction. Depending on theintended application, the source of the incident light can be a laser,laser diode, a light-emitting diode (LED), a ultra-violet light bulb,and/or a white light source. Further, the excitation light may beevanescent light, e.g., as in total internal reflection microscopy,certain types of waveguides that carry light to a reaction site (see,e.g., U.S. Application Pub. Nos. 20080128627, 20080152281, and200801552280), or zero mode waveguides, described below. Where desired,more than one source can be employed simultaneously. The use of multiplesources is particularly desirable in applications that employ multipledifferent reagent compounds having differing excitation spectra,consequently allowing detection of more than one fluorescent signal totrack the interactions of more than one or one type of moleculessimultaneously (e.g., multiple types of differentially labeled reactioncomponents). A wide variety of photon detectors or detector arrays areavailable in the art. Representative detectors include but are notlimited to an optical reader, a high-efficiency photon detection system,a photodiode (e.g. avalanche photo diodes (APD)), a camera, acharge-coupled device (CCD), an electron-multiplying charge-coupleddevice (EMCCD), an intensified charge coupled device (ICCD), and aconfocal microscope equipped with any of the foregoing detectors. Forexample, in some embodiments an optical train includes a fluorescencemicroscope capable of resolving fluorescent signals from individualsequencing complexes. Where desired, the subject arrays of opticalconfinements contain various alignment aides or keys to facilitate aproper spatial placement of the optical confinement and the excitationsources, the photon detectors, or the optical train as described below.

The subject optical system may also include an optical train whosefunction can be manifold and may comprise one or more opticaltransmission or reflection elements. Such optical trains preferablyencompass a variety of optical devices that channel light from onelocation to another in either an altered or unaltered state. First, theoptical train collects and/or directs the incident wavelength to thereaction site (e.g., optical confinement). Second, it transmits and/ordirects the optical signals emitted from the reactants to the photondetector. Third, it may select and/or modify the optical properties ofthe incident wavelengths or the emitted wavelengths from the reactants.Illustrative examples of such optical transmission or reflectionelements are diffraction gratings, arrayed waveguide gratings (AWG),optical fibers, optical switches, mirrors (including dichroic mirrors),lenses (including microlenses, nanolenses, objective lenses, imaginglenses, and the like), collimators, optical attenuators, filters (e.g.,polarization or dichroic filters), prisms, wavelength filters (low-pass,band-pass, or high-pass), planar waveguides, wave-plates, delay lines,and any other devices that guide the transmission of light throughproper refractive indices and geometries. One example of a particularlypreferred optical train is described in U.S. Patent Pub. No.20070036511, filed Aug. 11, 2005, and incorporated by reference hereinin its entirety for all purposes.

In a preferred embodiment, a reaction site (e.g., optical confinement)containing a reaction of interest is operatively coupled to a photondetector. The reaction site and the respective detector can be spatiallyaligned (e.g., 1:1 mapping) to permit an efficient collection of opticalsignals from the reactants. In certain preferred embodiments, a reactionsubstrate is disposed upon a translation stage, which is typicallycoupled to appropriate robotics to provide lateral translation of thesubstrate in two dimensions over a fixed optical train. Alternativeembodiments could couple the translation system to the optical train tomove that aspect of the system relative to the substrate. For example, atranslation stage provides a means of removing a reaction substrate (ora portion thereof) out of the path of illumination to create anon-illuminated period for the reaction substrate (or a portionthereof), and returning the substrate at a later time to initiate asubsequent illuminated period. An exemplary embodiment is provided inU.S. Patent Pub. No. 20070161017, filed Dec. 1, 2006.

In particularly preferred aspects, such systems include arrays ofreaction regions, e.g., zero mode waveguide arrays, that are illuminatedby the system, in order to detect signals (e.g., fluorescent signals)therefrom, that are in conjunction with analytical reactions beingcarried out within each reaction region. Each individual reaction regioncan be operatively coupled to a respective microlens or a nanolens,preferably spatially aligned to optimize the signal collectionefficiency. Alternatively, a combination of an objective lens, aspectral filter set or prism for resolving signals of differentwavelengths, and an imaging lens can be used in an optical train, todirect optical signals from each confinement to an array detector, e.g.,a CCD, and concurrently separate signals from each different confinementinto multiple constituent signal elements, e.g., different wavelengthspectra, that correspond to different reaction events occurring withineach confinement. In preferred embodiments, the setup further comprisesmeans to control illumination of each confinement, and such means may bea feature of the optical system or may be found elsewhere is the system,e.g., as a mask positioned over an array of confinements. Detaileddescriptions of such optical systems are provided, e.g., in U.S. PatentPub. No. 20060063264, filed Sep. 16, 2005, which is incorporated hereinby reference in its entirety for all purposes. While optical systemssuch as those described above are a preferred detection system, otherdetection systems known to those of skill in the art are alsocontemplated, in particular, systems that detect passage of moleculesthrough pores in a membrane or other surface, e.g., nanopores; systemsthat detect charge-based changes; systems that detect pH-based changes;and systems that detect electrical changes, and the like.

The systems of the invention typically include information processors orcomputers operably coupled to the detection portions of the systems, inorder to store the signal data obtained from the detector(s) on acomputer readable medium, e.g., hard disk, CD, DVD or other opticalmedium, flash memory device, or the like. For purposes of this aspect ofthe invention, such operable connection provides for the electronictransfer of data from the detection system to the processor forsubsequent analysis and conversion. Operable connections may beaccomplished through any of a variety of well-known computer networkingor connecting methods, e.g., Firewire®, USB connections, wirelessconnections, WAN or LAN connections, or other connections thatpreferably include high data transfer rates. The computers alsotypically include software that analyzes the raw signal data, identifiessignal pulses that are likely associated with incorporation events, andidentifies bases incorporated during the sequencing reaction, in orderto convert or transform the raw signal data into user interpretablesequence data (see, e.g., Published U.S. Patent Application No.2009-0024331, the full disclosure of which is incorporated herein byreference in its entirety for all purposes).

Exemplary systems are described in detail in, e.g., U.S. patentapplication Ser. No. 11/901,273, filed Sep. 14, 2007 and U.S. patentapplication Ser. No. 12/134,186, filed Jun. 5, 2008, the fulldisclosures of which are incorporated herein by reference in theirentirety for all purposes.

Further, the invention provides data processing systems for transformingraw data generated in an analytical reaction into analytical data thatprovides a measure of one or more aspects of the reaction underinvestigation, e.g., transforming signals from a sequencing-by-synthesisreaction or pore-translation reaction into nucleic acid sequence readdata, which can then be transformed into consensus sequence data. Incertain embodiments, the data processing systems include machines forgenerating nucleic acid sequence read data by polymerase-mediatedprocessing of a template nucleic acid molecule (e.g., DNA or RNA). Thenucleic acid sequence read data generated by such a reaction isrepresentative of the nucleic acid sequence of the nascentpolynucleotide synthesized by a polymerase translocating along a nucleicacid template only to the extent that a given sequencing technology isable to generate such data, and so may not be identical to the actualsequence of the nascent polynucleotide molecule. For example, it maycontain a deletion or a different nucleotide at a given position ascompared to the actual sequence of the polynucleotide, e.g., when anucleotide incorporation is missed or incorrectly determined,respectively. As such, it is beneficial to generate redundant nucleicacid sequence read data, and to transform the redundant nucleic acidsequence read data into consensus nucleic acid sequence data that isgenerally more representative of the actual sequence of thepolynucleotide molecule than nucleic acid sequence read data from asingle read of the nucleic acid molecule. Redundant nucleic acidsequence read data comprises multiple reads, each of which includes atleast a portion of nucleic acid sequence read that overlaps with atleast a portion of at least one other of the multiple nucleic acidsequence reads. As such, the multiple reads need not all overlap withone another, and a first subset may overlap for a different portion ofthe nucleic acid sequence than does a second subset. Such redundantsequence read data can be generated by various methods, includingrepeated synthesis of nascent polynucleotides from a single nucleic acidtemplate, synthesis of polynucleotides from multiple identical nucleicacid templates, or a combination thereof.

In another aspect, the data processing systems can include software andalgorithm implementations provided herein, e.g. those configured totransform redundant nucleic acid sequence read data into consensusnucleic acid sequence data, which, as noted above, is generally morerepresentative of the actual sequence of the nascent polynucleotidemolecule than nucleic acid sequence read data from a single read of asingle nucleic acid molecule. Further, the transformation of theredundant nucleic acid sequence read data into consensus nucleic acidsequence data identifies and negates some or all of the single-readvariation between the multiple reads in the redundant nucleic acidsequence read data. As such, the transformation provides arepresentation of the actual nucleic acid sequence of the nascentpolynucleotide complementary to the nucleic acid template that is moreaccurate than a representation based on a single read.

Various methods and algorithms for data transformation employ dataanalysis techniques that are familiar in a number of technical fields,and are generally referred to herein as statistical analysis. Forclarity of description, details of known techniques are not providedherein. These techniques are discussed in a number of availablereference works, such as those provided in U.S. Pat. Nos. 8,182,993 and8,370,079; U.S. Patent Publication No. 2012/0330566; U.S. Ser. No.13/468,347, filed May 10, 2012; U.S. Ser. No. [unknown], Attorney DocketNo. 01-015901, Chin, et al., entitled “Hierarchical Genome AssemblyMethod Using Single Long Insert Library,” filed Mar. 14, 2013; U.S. Ser.No. 61/671,554, filed Jul. 13, 2012; and U.S. Ser. No. 61/116,439, filedNov. 20, 2008, the disclosures of which are incorporated herein byreference in their entireties for all purposes.

The software and algorithm implementations provided herein arepreferably machine-implemented methods, e.g., carried out on a machinecomprising computer-readable medium configured to carry out variousaspects of the methods herein. For example, the computer-readable mediumpreferably comprises at least one or more of the following: a) a userinterface; b) memory for storing raw analytical reaction data; c) memorystoring software-implemented instructions for carrying out thealgorithms for transforming the raw analytical reaction data intotransformed data that characterizes one or more aspects of the reaction(e.g., rate, consensus sequence data, etc.); d) a processor forexecuting the instructions; e) software for recording the results of thetransformation into memory; and f) memory for recordation and storage ofthe transformed data. In preferred embodiments, the user interface isused by the practitioner to manage various aspects of the machine, e.g.,to direct the machine to carry out the various steps in thetransformation of raw data into transformed data, recordation of theresults of the transformation, and management of the transformed datastored in memory.

As such, in preferred embodiments, the methods further comprise atransformation of the computer-readable medium by recordation of the rawanalytical reaction data and/or the transformed data generated by themethods. Further, the computer-readable medium may comprise software forproviding a graphical representation of the raw analytical reaction dataand/or the transformed data, and the graphical representation may beprovided, e.g., in soft-copy (e.g., on an electronic display) and/orhard-copy (e.g., on a print-out) form.

The invention also provides a computer program product comprising acomputer-readable medium having a computer-readable program codeembodied therein, the computer readable program code adapted toimplement one or more of the methods described herein, and optionallyalso providing storage for the results of the methods of the invention.In certain preferred embodiments, the computer program product comprisesthe computer-readable medium described above.

In another aspect, the invention provides data processing systems fortransforming raw analytical reaction data from one or more analyticalreactions into transformed data representative of a particularcharacteristic of an analytical reaction, e.g., an actual sequence ofone or more template nucleic acids analyzed, a rate of anenzyme-mediated reaction, an identity of a kinase target molecule, andthe like. Such data processing systems typically comprise a computerprocessor for processing the raw data according to the steps and methodsdescribed herein, and computer usable medium for storage of the raw dataand/or the results of one or more steps of the transformation, such asthe computer-readable medium described above.

As shown in FIG. 3, the system 300 includes a substrate 302 thatincludes a plurality of discrete sources of chromophore emissionsignals, e.g., an array of zero mode waveguides 304. An excitationillumination source, e.g., laser 306, is provided in the system and ispositioned to direct excitation radiation at the various signal sources.This is typically done by directing excitation radiation at or throughappropriate optical components, e.g., dichroic 308 and objective lens310, that direct the excitation radiation at the substrate 302, andparticularly the signal sources 304. Emitted signals from the sources304 are then collected by the optical components, e.g., objective 310,and passed through additional optical elements, e.g., dichroic 308,prism 312 and lens 314, until they are directed to and impinge upon anoptical detection system, e.g., detector array 316. The signals are thendetected by detector array 316, and the data from that detection istransmitted to an appropriate data processing system, e.g., computer318, where the data is subjected to interpretation, analysis, andultimately presented in a user ready format, e.g., on display 320, orprintout 322, from printer 324. As will be appreciated, a variety ofmodifications may be made to such systems, including, for example, theuse of multiplexing components to direct multiple discrete beams atdifferent locations on the substrate, the use of spatial filtercomponents, such as confocal masks, to filter out-of focus components,beam shaping elements to modify the spot configuration incident upon thesubstrates, and the like (See, e.g., Published U.S. Patent ApplicationNos. 2007/0036511 and 2007/095119, and U.S. patent application Ser. No.11/901,273, all of which are incorporated herein by reference in theirentireties for all purposes.)

XIV. Example Analysis of Bacterial Methylomes

Sequencing template libraries were prepared as previously described(Clark, et al. (2012) Nucleic Acids Res., 40, e29; and Travers, et al.(2010) Nucleic Acids Res. 38, e159). Briefly, genomic DNA samples weresheared to an average size of −800 base pairs via adaptive focusedacoustics (Covaris; Woburn, Mass., USA), end repaired, and ligated tostem-loop adapters to form SMRTbell™ templates. Incompletely formedSMRTbell templates were digested with a combination of Exonuclease III(New England Biolabs; Ipswich, Mass., USA) and Exonuclease VII(Affymetrix; Cleveland, Ohio, USA). SMRT® Sequencing was carried out onthe PacBio® RS (Pacific Biosciences; Menlo Park, Calif., USA) usingstandard protocols for small insert SMRTbell libraries. All restrictionendonucleases except Eco1471 (Fermentas; Glen Burnie, MD), Phusion-HFDNA polymerase, Antarctic Phosphatase, T4-DNA ligase, and E. colicompetent cells were from New England Biolabs Inc. (Ipswich, Mass.,USA). Synthetic oligonucleotides were purchased from Integrated DNATechnologies (Coralville, Iowa, USA). Geobacter metallireducens GS-15ATCC 53774 DNA, Chromohalobacter salexigens DSM 3043 DNA and Bacilluscereus ATCC 10987 DNA were obtained from the culture collectionsindicated. Vibrio breoganii 1C-10 DNA was a gift from Martin Polz, MIT.Campylobacter jejuni subsp. jejuni 81-176 and Campylobacter jejuni NCTC11168 DNAs were a gift from Stuart Thompson, Medical College of Georgia.

Sequencing reads were processed and mapped to the respective referencesequences using the BLASR mapper(http://www[dot]pacbiodevnet[dot]com/SMRT-Analysis/Algorithms/BLASR) andthe Pacific Biosciences' SMRT Analysis pipeline(http://www[dot]pacbiodevnet[dot]com/SMRT-Analysis/Software/SMRT-Pipe)using the standard mapping protocol. Interpulse durations (IPDs) weremeasured as previously described (Flusberg, et al. (2010) Nat. Methods,7, 461-465) and processed as described (Clark, et al., supra) for allpulses aligned to each position in the reference sequence. To identifymodified positions, Pacific Biosciences' SMRTPortal analysis platform,v. 1.3.1, which uses an in silico kinetic reference, and a t-test basedkinetic score detection of modified base positions (details areavailable athttp://www[dot]pacb[dot]com/pdf/TN_Detecting_DNA_Base_Modifications.pdf).Methyltransferase target sequence motifs were identified by selectingthe top one thousand kinetic hits and subjecting a +/−20 base windowaround the detected base to MEME-ChIP (Machanick, P. and Bailey, T. L.(2011) Bioinformatics. 27, 1696-7). To measure the extent of methylationfor each motif in a genome, a kinetic score threshold was chosen suchthat 1% of the detected signals were not assigned to any MTaserecognition motifs (5% for B. cereus to accommodate for the lower signalintensities for 4-methylcytosine). We subjected this 1% population ofsequence context to another round of MEME-ChIP analysis to confirm theabsence of any additional consensus motifs. No accumulation of motifsthat harbored similarities to the identified active motifs wereobserved. All kinetic data files have been deposited in GEO (Accession#: GSE40133) (Sayers, et al. (2012) Database resources of the NationalCenter for Biotechnology Information. Nucleic Acids Res., 40 (Databaseissue), D13-25)(http://www[dot]ncbi[dot]nlm[dot]nih[dot]gov/geo/summary/).

The SEQWARE computer resource was used to identifyrestriction-modification system genes from the complete genome sequencesof G. metallireducens GS-15 (GenBank # CP000148 and CP000149), C.salexigens (GenBank # CP000285), B. cereus (GenBank # AE017194 andAE017195), C. jejuni subsp. jejuni 81-176 (GenBank # CP000538, CP000549and CP000550), C. jejuni NCTC 11168 (GenBank # AL111168) and V.breoganii 1C-10 (GenBank # AKXW00000000). Software modules combined withinternal databases constitute the SEQWARE resource. New sequence datawas scanned locally for homologs of already identified and annotatedrestriction-modification systems in REBASE (Roberts, R. J., et al.(2010) Nucleic Acids Res., 38, D234-D236). Sequence similarity fromBLAST searches, the presence of predictive functional motifs (Posfai, etal. (1989) Nucleic Acids Res., 17, 2421-2435; and Klimasauskas, et al.(1989) Nucleic Acids Res., 17, 9823-9832) and genomic context are thebasic indicators of potential new restriction-modification systemcomponents. Heuristic rules, derived from knowledge about the genestructure of restriction-modification systems, are also applied torefine the hits. Attempts are made to avoid false hits caused by strongsequence similarity of RNA and protein methyltransferases, or hits basedsolely on non-specific domains of restriction-modification enzymes, suchas helicase or chromatin remodeling domains. SEQWARE then localizesmotifs and domains, assigns probable recognition specificities,classifies accepted hits and marks Pfam relationships. All candidatesare then inspected manually before being assigned as part of anrestriction-modification system. The results are entered into REBASE.

Selected MTase genes were amplified from bacterial genomic DNA withPhusion-HF DNA polymerase and cloned into the plasmid pRRS as describedpreviously (Clark, et al., supra). When no suitable sites were presentelsewhere in the construct, restriction sites diagnostic for thepredicted methylation pattern were incorporated into the 3′-endoligonucleotides. The presence or absence of specific methylation wasdetermined by digesting the constructs with appropriate restrictionenzymes. Host strains used for cloning included E. coli ER2796 (Kong,H., et al. (2000) Nucleic Acids Res., 28, 3216-3223) and E. coli ER2683(Sibley, M. H. and Raleigh, E. A. (2004) Nucleic Acids Res., 32,522-534).

The Csa_1401 and Gmet_0255 genes were cloned into a plasmid using theGibson assembly technique (Gibson, et al. (2009) Nat. Methods 6,343-345.). The plasmid vector was PCR-amplified, and the MTase geneswere amplified using primers having 5′ tails that overlap with the endsof the amplified vector. PCR-amplified DNAs were purified over a Qiagenspin column. 0.1 pmol vector was combined with 0.3 pmol MTase geneinsert in 20 μl 1× Gibson assembly reaction (New England Biolabs) andincubated at 50° C. for 1 hour. 2 μl of this assembled construct wastransformed into 50 μl chemical competent E. coli ER2796 cells andplated on LB-ampicillin plates at 37° C. overnight.

This study represented one of the first times that the completemethylation pattern of a bacterial genome has been examined. For themethyltransferases studied, seven are components of Type Irestriction-modification systems and have six different recognitionsequences, all of which were new. In addition, two Type III systems werefound with one new recognition sequence, and two methyltransferases werepart of traditional Type II systems, although the activity of thecorresponding restriction endonuclease was not confirmed. Finally, fourType IIG restriction endonucleases, which contain both methyltransferaseand restriction endonuclease activity in a single protein, were found,all with new specificities.

Despite the recognized importance of methylation for understandingfundamental microbiological processes, microbe adaptability and diseasepathogenicity (Srikhanta, et al. (2010) Nat. Rev. Microbiol., 8,196-206; and Casadesús, et al. (2006) Microbiol. Mol. Biol. Rev., 70,830-56), there has not been a great deal of research into themethylation patterns of bacterial genomes, largely because of thedifficulty of obtaining suitable data. One area where knowledge aboutthe methylome is very important relates to studies trying to transformDNA into strains that contain one or more restriction-modificationsystems that vastly reduce transformation efficiencies. In some casesthese barriers have been overcome by premethylating the DNA, or byremoving the restriction-modification systems from strains (Donahue, etal. (2000) Mol. Microbiol. 37, 1066-1074; and Dong, et al. (2010) PLoSONE 5, e9038). One problem with the latter approach is that removal ofmethylation systems may fundamentally change the biology of the organismunder study. With the kind of analysis provided here, therestriction-modification systems likely to cause problems withtransformation can be easily spotted and appropriate measures taken.Thus, the methyltransferases necessary for protection can be identifiedand if needed intermediate cloning hosts carrying suitable complementsof methyltransferase genes can be prepared.

The results provided here show that SMRT® sequencing can providefunctional information about active methyltransferases present ingenomes and can decipher their recognition sequences, a task that usedto be time-consuming to a point where it was not usually carried out.This, combined with the long reads provided by this technology can be anexcellent adjunct to current high-throughput sequencing platforms, inthat sequence assembly is facilitated and gene function is reliablydocumented. Further details of this study are provided in Murray, et al.(2012) Nucleic Acids Res. 40(22):11450-62, which is incorporated hereinby reference in its entirety for all purposes.

It is to be understood that the above description is intended to beillustrative and not restrictive. It readily should be apparent to oneskilled in the art that various embodiments and-modifications may bemade to the invention disclosed in this application without departingfrom the scope and spirit of the invention. For example, all thetechniques and apparatus described above can be used in variouscombinations. The scope of the invention should, therefore, bedetermined not with reference to the above description, but shouldinstead be determined with reference to the appended claims, along withthe full scope of equivalents to which such claims are entitled. Allpublications mentioned herein are cited for the purpose of describingand disclosing reagents, methodologies and concepts that may be used inconnection with the present invention. Nothing herein is to be construedas an admission that these references are prior art in relation to theinventions described herein. All publications, patents, patentapplications, and/or other documents cited in this application areincorporated by reference in their entirety for all purposes to the sameextent as if each individual publication, patent, patent application,and/or other document were individually and separately indicated to beincorporated by reference for all purposes.

1-51. (canceled)
 52. A method of generating a haplotype, the methodcomprising: a) providing a genomic DNA sample comprising a DNA fragmentof at least 2 kb in length, wherein the DNA fragment comprises a regionof interest; b) subjecting the DNA fragment to a single-moleculesequencing reaction to generate a single sequence read that extends thefull length of the DNA fragment comprising the region of interest,wherein the single sequence read comprises sequence information andkinetic information, wherein the kinetic information is indicative ofmodified bases within the DNA fragment; c) analyzing the sequenceinformation in the single sequence read to determine a base sequence forthe region of interest; d) analyzing the kinetic information in thesingle sequence read to identify modified bases within the region ofinterest; and e) phasing the base sequence and the modified bases in theregion of interest, thereby generating a haplotype that comprises boththe base sequence of the region of interest and modified bases withinthe region of interest.
 53. The method of claim 52, wherein a pluralityof fragments comprising the region of interest are each subjected to asingle-molecule sequencing reaction within a single reaction mixture togenerate a plurality of single sequence reads, and the analyzing furthercomprises generating a plurality of haplotypes for the region ofinterest.
 54. The method of claim 53, further comprising constructing aconsensus haplotype sequence from the plurality of haplotypes.
 55. Themethod of claim 52, wherein the genomic DNA sample comprises one or moregenomic DNAs selected from the group consisting of a human genomic DNA,a bacterial genomic DNA, a viral genomic DNA, a fungal genomic DNA, aforensic genomic DNA, a patient's genomic DNA, a diagnostic genomic DNA,a prognostic genomic DNA, a stem cell genomic DNA, a pluripotent cellgenomic DNA, a parasitic organism genomic DNA, an embryonic genomic DNA,and a cancer cell genomic DNA.
 56. The method of claim 52, wherein thegenomic DNA sample comprises both a maternal DNA fragment and a fetalDNA fragment from a single maternal blood sample, and further wherein afirst haplotype is generated for the region of interest in the maternalDNA fragment and a second haplotype is generated for the region ofinterest in the fetal DNA fragment.
 57. The method of claim 52, whereinthe genomic DNA sample comprises both a non-tumor-derived DNA fragmentand a tumor-derived DNA fragment from a single patient sample, andwherein a first haplotype is generated for the region of interest in thenon-tumor-derived DNA fragment and a second haplotype is generated forthe region of interest in the tumor-derived DNA fragment, and furtherwherein differenced between the first haplotype and the second haplotypecorrespond to mutations that occurred during development of a tumor. 58.The method of claim 52, wherein the genomic DNA sample comprises a pairof homologous chromosomes, the pair comprising a first homolog and asecond homolog, and further wherein a first haplotype is generated forthe region of interest in the first homolog and a second haplotype isgenerated for the region of interest in the second homolog.
 59. Themethod of claim 52, further comprising isolating the genomic DNA samplefrom a biological source selected from the group consisting of a bloodsample, a sputum sample, a urine sample, a nasopharangeal sample, avaginal sample, a biopsy, a forensic sample, a buccal sample, a patientsample, diagnostic sample, prognostic sample, embryonic sample, cancergenomic sample, a metagenomic sample, and a colonic sample.
 60. Themethod of claim 59, wherein the biological source is a metagenomicsample comprising a plurality of different types of microorganisms. 61.The method of claim 60, wherein the microorganisms comprise at least onemicrobe selected from the group consisting of bacteria, archaea, fungi,viruses, and protozoa.
 62. The method of claim 60, wherein at least oneof the microorganisms is a pathogenic organism.
 63. The method of claim52, wherein the genomic DNA sample is not amplified,bisulfite-converted, or cloned prior to said single-molecule sequencingreaction.
 64. The method of claim 52, wherein the single-moleculesequencing reaction is selected from sequencing-by-incorporation, tSMSsequencing, and nanopore sequencing.
 65. The method of claim 52, whereinthe single-molecule sequencing reaction has average read lengths in the1000 to 10,000 base pair range.
 66. The method of claim 52, wherein thekinetic information in the single sequence read is indicative ofpatterns of modified bases within the DNA fragment.
 67. The method ofclaim 52, wherein the modified bases in the region of interest compriseat least one of 6-mA, 4-mC, 5-mC, hmC, phosphorothioate, andglucosylated hmC.
 68. The method of claim 52, wherein the region ofinterest is selected from the group consisting of a highly repetitiveregion, a euchromatic region, a heterochromatic region, an imprintedregion, a pseudogene, a gene, and a regulatory region.
 69. The method ofclaim 52, wherein the read length of the single sequence read is atleast 3000 base pairs in length.
 70. The method of claim 52, furthercomprising converting the genomic DNA sample into a sequencing libraryprior to said single-molecule sequencing reaction.
 71. The method ofclaim 52, wherein the genomic DNA sample comprises a gene and apseudogene, and wherein a first haplotype is generated for the region ofinterest in the gene and a second haplotype is generated for the regionof interest in the pseudogene, and further wherein the first haplotypeand second haplotype comprise different patterns of modified bases.